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INTRODUCTION 

,At the beginning, this study started . 
out to be strictly an empirical analysis of 
available data tQ obtain s^ome better insight , 
into the psychology and mechanics of guess - 

♦»ing. As the data began to unfold, however, * 
it was apparent that far more was involved \ 
than guessing alone. Indeed,, the entire * 
fabric of test -utilization behavior was in 
question.. Many questions about the real 

'-reasons fo-r testing % under different circum- 
stances very quickly came to mind. 

4 New emphasis on item analysis as a., 
means of interpreting test results r^en- 
, forced the idea that the answers to multiple 
choice questions were too casually being 
equated to actual work-sample performance/ 
It was clear that the functioning of any , 
item was 'more closely related to its appar- 
ent ease or (difficulty for the test- taking 
group than, we had sensed in the past. 
* • * 

t It was also realized that some, subject- 1 
matter was far more rea.dily adaptable to* ob- 
jective-type eest items than other types of* 
content. Finally, a little thought convinced 
this Writer tftat modification of the item „ 
types being, used could practically eliminate 
the guessing factor, without making tt^e test 
impossible to score electronically, while 
yielding substantial amounts of data not now" 
being obtained . 

However, the first priority' in the anal- 
ysis of the available data is still the in- 
vestigation of GUESSING behavior, how it. can 
best be eliminated or how it cat* . be counter- 
acted, and how the general attitude toward * 
testing on the p'art of teachers and pupils . 
can be improved with. the consequent improve- 
ment of ' the educational process - especially 
with educationally handicapped children. 
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The study of the effect bfv guessing in 
objective -type examinations, has \been a Con- 
cern of the test-makers and publishers al- 
most from the^ beginning of the effort to 
construct such tests. 

For example, the first achievement test 
battery to be published and ^widely used 
throughout this country was the 5*tanfc>rd 
Achievement Test, Forms A and B, Copyright 
1923. These tests, covering' grades 2-^8, 
included a wide variety of ftems, all rather 
steeply graded with respect to difficulty. 



/ * 
^ -TJie Primary Battery of this series*, con- 
sisted ojf certain items. from the beginning 
^part of the Advanced BaJttery, so there was 
no actual differentiation in content between 
the Primary Battery and the Advanced Batr, 
tery.* In other words, the so-called Prjfaary t 
Battery was notf^truly a primary gfades bat- 
tery in the current sense of the word. - '* 
* ">*», 
In several of the subtests in th'is pio- 
neering test series, items which were of the 
multiple choice-type were used, and where 
this was the case a correction for'guessing 
was indicated in^ tWie scoring, directions - 
although the Manual of Directions contained 
no specific instructions about the effect of 
guessing or no.t guessing and no ins tructidns 
to guess or not t© guess . „ 

More specifically, the score for. the . 
Reading: Sentence Meaning Test was indicated 
to be number right minus number wrong. The 
"pupil directions for .this test read a5 fol- 
lows :> < , , ■« 

"kead the first sentence at the top of the 
page. It says: 'Can dogs bark? Yes . Uo. 1 , 
The' right answer is f Yes f * so the word Yes 
h^s a line under it. " — 

"Look at the second sentence (slowly). f boes 
a cat^have six legs? Yes No. 1 This, time 
the correct answer is ^No 1 , so' the word No 
^has a, line v under it" Now you must read each 
question on this page and draw a line under 
the right answer , Ready? % Go.* ,f 

Note that no indication was^giveh to 
the students that the score* would jbe rights 
minus wrongs 1 , as stipulated in tfie scoring 
directions*. ( . N " . 

A similar correction, for guessing* was 
employed in Test 6: Nature Study and Science* 
which j used three-choice multiple choice 
questions, the directions for scoring saying 
simply that the score was -number right minus 
one -ha If the number wrong. A comparable cor- 
rection was used in Test 7: History "and Lit- 
erature. In Test 8: Language Usage, 'the 
score once more was number right minus num- 
ber wrong. All other tests in this battery 
"were of such**a type as not to allow for a 
correction for guess ing., 

* 

Dr. Giles M. Ruch wa's the junior author 
of the Stanford Achievement Test, along with 
L^wis M, Terman and Truman L. Ke^lley. - both 
very well known educational psychologists* 
Giles M. Ruch seems to be the qrie in this 
authorship team who was particularly 
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'interested in the problem of guessing and 
its effect on reliability and validity. 

In a book by Ruch that must be consid- 
ered a classic 'in testing literature, enti-. 
tied The Objective or New-Type Examination , 
1/ there is a section^ "Part III: Experimen- 
•tal arm Theoretical Considerations 1 ',, which 
reviews several studies concerning ^guessing 1 
as a test-taking "behavior and also the total 
impact of the correction for guessing. A 
general survey of the research* studies re- • 
ported in this chapter, , including some by 
Ruch (with others), seems to be that the 
correction for guessing o£ the standard type 
does, indeed, increase the reliability and 
^ the validity, of the tests spmewhat in most 

* instances - although the' gains are not 
great. However, no indication is, given that 
corrected and uncorrected scores correlate 
*1.00 i£ all icems are answered, as was sub- 
sequently realized. 

This experimental .Section also reports 
on the effect of increasing the jnumber of 
choices up to a maximum of seven and.sug- . 
gests that tfie five-choice item is probably 
the optimal number of choices »to use »where 
sensib3*e alternatives can be found. 

Morer recently, studies in the area df 
correction for guessing have been comprehen- 
sively reviewed. aqS summarized in an article 
by James Diamond and William Evans published 
' in' the, Review of Edudational Research Spring 
1973. 2/ In v the summary of this article, im- 
mediately prior to the comprehensive lifting 

* of references, this sentence appears: 

. !, By way of siummary, one might note that the 
t standard correction for guessing implies on- 
ly'one model of test-taking behavior. Per- 
haps new, computer oriented weighting proce- 
dtfres will allow us to expand the model and 
% tb consider '.other -factors in v test scoring, 

* guessing, reliability and* validity .i' 

*£he bibliography , included with this ar- 
ticle under the various subtopics is both 
comprehensive as to time sjSan and, -in view 
of the length^ of the bibliography, indicates 
the continued concern with the problem of 
guessing in and up to the present moment - 

, as suggested by the. reference to new comput- 
er oriented procedures which will allow for 
weighting of test items to "correct for 

- guess itig. 



1/ ftuch, Giles M. The Objective or tiew-Type 
Examination , .Scott',. Foresman £nd Co., 1929. 

2/ Diamond, James., and William Evans. !, The 
Uofrectlcfy. for Guessing. 11 Review of Educa - 
tional Research , 1973, 43, 1ST. 



*The study with which this, report is 
concerned differs essentially from any of > 
the studies reported so far in the litera- 
ture - as nearly as, can be^ discovered by a 
superficial review.of the titles/ in the 
"Diamond-Evans bibliography and some personal 
investigations of the author. 

It has been customary to note in the 
most -recent textbooks and other authorita- 
tive sources that the 'correction for guess- 
ing is totally ineffective when all items iti 
the test are attempted, since the corrected 
scores and the uncorrected scores will have 
a correlation of 1.00. 

For the benefit of those for whom ttiis 
truth is *not self-evident, Chart 1-1 shows a 
biv^ariate of .the actual scores of 62 cases 
attempting all items on one test versus the 
corrected scores for these same children. 
Note that the only variation in rank order 
is due to rounding off the corrected scores. 

"The article by Diamond. and Evans in the. 
review mentioned above indicates that under 
certain circumstances, other than the at- 
tempting of all- »items , the same phenomenon 
.is true. In any case, the pursuit of a 
mathematical correction for guessing based 
on number right versus number wrong seems to 
be pretty much a lost cause; consequently, 
we must have* some way ^of approaching the 
problem quite differently. 

'One obvious way would be to find a dif- 
ferent approach to the. identification of the 
child who actually guesses as compared to 
the person who has partial knowledge or has 
extensive^ knowledge and answers most of the 
questions from a t\asis of information ex- 
ceeding that of his peers. To anticipate 
some developments which will be described 
later^on, it is evident to the writer* that 
the chances of finding any mathematical so- 
lution to this problem, at this time and un- 
der present circumstances, is quite unlikely. 

The other long-term approach is to de- 
vise a way of testing which will be essen- 
tially free of guessing and, therefore, will 
cause the'problenr *of guessing to disappear. 

Obviously if one uses a work-sample 
method of testing, in which the* child does 
the thing on which^ he is supposed^ to" be be- 
ing measured, then guessing is nullified - 
since he must perform the very t^jsk he is 
expected to perform in real life* 

' The best* example is, perhaps, in arith- 
metic - where in a computation situation, 
such as one calling for the multiplication 
o\f two, two-place -numbers , the child actually 
does out the work and records his answer - 
possibly transferring £he answer to ^ 0 
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] Chart I-i 

V « 
A Demonstration Bivanate to Show the One-to-One Relationship 

Between Corrected and Uncorrected Scores When All 

Items Are Attempted* 



Actual Score 
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marginal answer space for ease in .hand-scor- 
ing; but he does not come- up with a response 
that is scorable by electronic methods. , 

The work-sample approach is, of course, 
older ^than standardized testing -(as old as 
education, in fact). The £Lrst Stanford 
Achievement Battery, namelj^Form A published 
in 1923, used the work-sample approach in a 
number of instances, one of which was in the 
Spelling Test. In this test a paragraph was 
dictated to the child, all of which was re- 
corded in writing by him. Only certain 
words irt the paragraph were considered rele- 
vant in regard to their correct, or incor- 
rect, spelling. Obviously, this was a tjime- 
consuming test to give and difficult to 
score - since, the words had to be found in* 
the context of the child's writing and then 
be evaluated as regards spelling correc- 
tions . In tR*e next subsequent edition, this 
method of measuring spelling was dropped "for 
a different approach . 

While some ingenuity might increase the 
efficiency of the operation, it seems rea- 
sonably clear that the answer to getting rid 
of guessing is not to go over entirely to 
the work-sample approach but to do something 
to change the child's attitude toward test- 
ing in general and .guessing in particular, 
while at the -same time making it much more 
difficult to guess and still get a correct 
respons^. 

A positive and logical approach to this 
problem is to analyze the types "of content 
to be measured and then try to devisfe new 
item types which will satisfy the criteria 
mentioned above, While at the same ti*e 
changing the altitude of teachers -and pupils 
toward the administration and interpretation 
of the test results . 

This proposal is a major task and this 
report is obviously not the place in which 
to discuss it in great detail. Suffice it 
to say now that the writer is convinced that 
the task is nat an impossible one and before 
concluding this report he will attempt to 
indicate ways in which giant steps can be 
taken to effect this desirable goal. 

The Purpose of the Study < % 

\ 

. t Initially, the purpose of this ^tudy ' 
wafc'to reveal by an intensive analysis of 
certain available data the extern^ to which 
guessing really^existed and the nature >of 
the groups wh'a were most inclined to guess 
as a way of ^responding to the test situation. 
Any other findings were thought of as being 
more or les6 secondary. 

* Along with the identification of the 
children who 'guesse'd went the almost equally 



serious problem of constructing a test that 
could be given at the beginning of the year 
and at the end of the year with meaningful 
analysis of the differences between the two 
testing periods,. 

The author has been working exhaustive- 
ly on this problem, which is really not so 
much statistical as it is logistical; -The 
data from the study clearly- shows that some- 
thing must be done to replace the usual 
achievement battery for the purpose of a 
before-aftef type of testing over short 
periods . 

Because of certain shortcomings, which 
' wiill be developed more fully at a later 
time, thejusual procedure for selecting 
standardized achievement test items won't^ 
work and' an alternative procedure must be 
found . r Q 

The Available Data 

; 

In the 1969-70 school year, the State 
Department of Education in New Hampshire 
conducted a statewide testing program in- 
volving the Stanford Achievement Test. 'The 
inclusive testing program covered grades 2, 
4, 6 and 8, but this report is concerned 
only with grade 4 - in which Intermediate I 
Battery: Form X was used. 

It was further stipulated that the same 
form of the test given in October to all pu- 
pils in the state would be given over agaiji 
in the spring to pupils identified as being 
in Title I projects.' The spring, testing of 
these Title I children was supposed to pro- 
vide the /lata in terms of which to evaluate 
the effectiveness of the instruction in the 
Title I projects as compared to the normal 
amount of growth during this period of time. 

Typically, this "normal amount of 
growth is expressed in terms of month of 
grade" equivalent and it has be^n considered 
satisfactory to set up some criterion, such 
as* growth of one school year during the 
seyen-rtonth period, as being the expectancy 
in' a successful program. 

Without getting intrf all of the com- 
plexities involved, ' grade equivalents are 
totally unsuitable for this purpose and al-, 
ways have been. They are b^sed upon testing 
over a period of twelve months even though 
,the amount* of gain from one testing period 
at a givers grade, such as grade 4, to the 
next testing period, at grade 5, is twelve 
months. Sfre \fact of summer forgetting is '< 
neglected totally, and it is assumed that a 
month of grade increment for reading means 
the same as an increment of one month in a- 
rithmetic, ,i.e. the rate of growth from sub- 
ject to subject is constant. All of these 
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assumptions have been proved to be totally 
incorrect. 

About the time this program 'was getting 
under way, this writer was asked to act -as a 
consultant to. the' State Department of Educa- 
tion and, specifically, to Title I within 
the State of New Hampshire to help implement 
a program that would be effective: The . 
first step that was taken along thes^ lines 
was to persuade the Department of Education 
to re-tes,t in the spring a random sample of 
pupils from the entire state as a control 
group so that the gains made by' the Title I 
pupils in the state could be compared with 
gains made by this random, and thus repre- 
sentative, sample for the state. 

.* 

This was an enormous step forward, "es- 
pecially in this context when there were no 
spring rrorms and. fall norms available for 
the Stanford Battery. Stanford was typical- 
ly standardized in the spring, as 'Metropoli- 
tan (l/)-is typically standardized in the 
fall, and extrapolations over the period of 
time from school year to school year did not 
provide a satisfactory method 'bf determining 
the amount of growth to be expected over a 
seven -month period - especially in the case 
of a test hiving a variety of subtests. 

♦ The identity of the children wit were 
to comprise the random sample was determined 
by use of a random technique employing the 
IEM-360 Model 50 computer at the University 
of New Hampshire. The work was done under 
the direction of the Bureau of Educational 
Research and Testing Services. 2/ ^ 

» 

It was specified that a total, of 1,000 
children out of about 10,t)00+ were to be • 
identified to constitute this random sample. 
These. children were further identified by 
school and a request was sent to the admin- 
istrators of the school districts in which 
these children resided to have them re- 
tested at the same time the Title I children 
were re-tested in the spring. v 



II Metropolitan ! 70 has both fall and spring 
standardization* programs . 

2/ 9 At this point, the writer would like to* 
express his appreciation £o Mr. Richard 
Clukay f<3r his gr"eat*help in preparing pro- 
grams , debugging th,em, and implementing the 
analysis of the data on £he computer - as 
indicated in the Title I Report as mentioned 
above . 



Apparently not all of the schools chose 
to comply to this request, S o fha't the total 
number o'f children actually tested for the^ 
random sample was a little more than 600 - 
as compared to the specified 1,000.-, The ex- 
perimental population, if one wants to call 
it that, consisted of 426 children in Che ' 
Title I program, concerning which much more 
will be said later vn a~s>eparate section,. 

Evaluating the \Random Sample ^ 

The scores fo* the fall testing program 
made by the children actually dr&wn for the 
random sample for whom results were avail- 
able both fall and spring were distributed 
on a number of variables and their results 
were compared with the total state_sample. 
The outcome of these comparisons is given in 
detail in the xeport to the State Department' 
of Education entitled "A Description and 
Evaluation of the Statewide Testing Program 
in New Hampshire in 1968-69 and 1969-70 un-- 
dei the Sponsorship of Title I and the Sig- 
nificance of the Data Obtained for Evalua- 
tion with this Activity. " This ^report was 
completed in July 1971 under a contract of 
Ma^ch 13, 1971. 



. It wo 
aundant to 
comparison 
us for ana 
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• 

uld> be, too space-consuming and re- 
reproduce in its entirety the 
of the random sample available to 
lysis with the £otal group for the 
it c^n be said that every type of>> 
f comparability obtained seemed 
incing that the two populations • 
ciently interchangeable for our 4 



In Chart 1-2 the random sample IQ dis- 
tribution.^ reproduced on a Normal Percen- 
tile Chart superimposed on the IQ distribu- 
tion for the state as a whole. It is obvi- 
ous by looking at % tlje two, distributions that 
they are, very nearly the- same. Statistical 
tests might have been suitable in this situ- 
ation but -were not thought to be necessary 
because aLi that was really needed was a " . 
population tested both falj. and spring that 
was reasonably representative erf the state 
as a whole. 

Random Issues Relative to Testing' 

^ " " 1 \ 

* * Even if Stanford Form X were a criteri- 
on reference test^ (which it <was not) in ' 
whicH.it was expected that/mos.t or nearly 
all of ,the children would answer most of the 
questions - or nearly all of theqj -'right, 
'even this then, would leave unanswered the 
extent to which the ablest children still 
are not learning as much as f they are capable 
of doing or that they are* learning the right 
kinds of things for them considering that $ 
they ave atypicaT^itlfTespect to the grade * 
as a whole . The , same generalization is 
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. certainly, true 'of the* Title I children who' 
I were,, by, selection and definition, also an 
atypical group. v 

•» It is* nice to know that^they (and* most 
of the'other' children) are learning or have ( 
learned, at one time or another, everything' 
considered to tfe relevant'for inclusion in a 
"criterion reference test, but it does npt 
• tell the^ teacher ^bout what to do next- for. " 
. tj*e exceptional .child. Hqw much farther 



v school systems fake their responsibility^for 
instructing teacher^ in the intricacies of 
.test, item writing very lightly, feeling that 
this is 'a £a£k that should have. bee,n accom- 
plished, during the course of their, under- 
graduate trSinih§. 

The fact of the master is that training 
in the development of tests is one of^the 
most neglected areas in teacher educafcioji, 
and few new teachers enter the classroom 



?2u7- th f exceptional ,child go on from the^e \prepared to undertake the simplest kind of 

tfest cQnstiiiction that,cpuUd be considered 
to be scientifically va\fid - to. s&y» nothing 
-of iriterpreting t*ve results' of tests, (espe r 
cially ones Supplemented 'by- local items) it>* 
a manner that is consistent with* the best 
statistical at\d. methodological practices in 
educational apd psychological measurement. 



^if his opportunities wete leiss limited as he 
leek step£ his way through a prescribed min 
imum foundati&ns curriculum? 'Contrariwise , 
. do Title I' children master the learning es- 
sential for them to master before^ going "on 
to jnew material? 



Sometimes it is possible to use a well- 
standardized general aqhifffl&ient examination 
as^the basic core f or*a s testing program ■* 
which is supplemented by locally-made test* 
items" to fill in the gaps covering learnings 
thought to be essential by #thc>se concerned, 
wikh curriculum matters a£ the local jdevel . 
Thus, the achievement test can be interpret- 
ed in terms of tjie norms provided for it, ^ 
and an item analysis/can* be done to give in- 
formation concerning performance on indivfd- 
1 ua lt i tems . 1 . % 

. To the #v score on the published test is 
added a score oij some supplementary test in- 
tended to round' c 'but the inadequacies c-f the 
standardized survey-type instrument-, and ^the 
total score plus item analysi-5 of aj.1 itenfs..* 
* is taken into account in determining the* 
"adequacy of the instruction in light of the*" 
needs of individual .pupils within tire 
schdol, class, or instructional group. 

.The most significant use of criterion f 
reference testing shqulcl be found where f 
teaching most desperately needs to be-indi- 
vidua lized;- name j.y, with those children who * 
show a disparity' in their performance from 
what is typical of 'their peer group. In 
these instances, it ^is then possible »to go ' 
on with a higher level of instruction for - 
some or to slow up the, pace urttil mastery 
reaches the lej/el established for others. 



Any ' idea ,* hSvjfeverT^tK^t children are 
, universally* going -to master tjie-material . 
typically found in a: textbook or recommended 
for teaching at a particular grade is just 
not gping to happen. Sometimes it takes as 
much as t-wO or three grades beyond the, grade 
level at which a topic is introduced before 
it is really adequately learned sq that it 
' becomes K a part of the tool kit for the child 
in attempting to JLearn o.r to attack problems 
at a higher curricuj^^ level . 

^ Many,, probably most, communities or 



s Advocates of mastery teaching, which is 
• the natural outgrowth of tfle criterion ref- 
erence approach to educational evaluation by 
test, recommend keeping a child working on a 
particular knowledge or skill, or coming 
back to it very frequently, until he* can de- 
. monstrate what they consider to be a satis- ^ 
factory mastery of 'that particular item or 
that skill. The writer has considerable 
sympathy with this point' of view if, and 
when, it "is possible to establish an hier- 
archy and to demonstrate that it is essen- 
tial that persons know certain material at a 
particular grade or development level before 
they should go on to 'another still higher 
level of instruction. 

In the 1890' s it was common for schools' 
to use textbooks which were graded, npt in 
the sense. of being assigned to a particular 
grade (as -this term *is used in this country 
now) but were sequential,' and a child was 

' required to stay in a particular "book" un- * 

, til he had mastered the content of that bftok 
tbefore he was allowed to proceed to the next 

\ one. In .those' days schools were small, of- 
.ten being of one or , two rooms,' and the 
teaqher c.ould handle this type of situation 
because the older children became the* teach-*' 
ers/ ojf % the ^younger and many children learned 

' in ^school by listening to their older bro- 
thers^and sisters recite as called upon by 
the- teacher'. 



With the modernization of education and 
the development of a grade system, this 
practice was, of course, abandoned. Now we 
find ourselves coming back to -a kind of 
structuring of* the curriculum and'of in- 
struction that closely corresponds* to the 
olden daysVor, in moire modern terms / closely 
corresponds to programmed instruction* (or 
the procedure, used in programmed instruction 
using its test, teach, test psychology). In 
r >i programmed instruction, a child is rarely 
supposed to proceed to the* next higher leyel 
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/^until he lias, successfully answered test' 
* items Supposedly indicating "mastery of.the 

knowledge or skill* currently being taught) 

afld the hierarchy, whether or ndt it dpes in 
• fa^t exist in truth, is there because it is r 

imposed by the person who >is responsible for 

"constructing the programmed text. 

It seems to maker, a great deal of sense, 
shfoweyer, fo„r us to re-examine the entire, 
curriculum arid, insofar as possible, to < 
break it down into behavioral objectives or 
performance objectives which can be shown to 
follow some hierarchy (even an artifical 
one) . Knowledges and skills which are pe- 
ripheral as such should be treated as such, 
allowing thetehild in a class to learn 
everything he can learn about the- world in 
which he lives - whether it is* formally con- 
sidered a part of the curriculum for which 
( the child should be responsible at that 
N point in his , development., 'er whether' it sim- 
ply is a way of broadening his understanding 
of his world. 

Individualization of the curriculum in 
the American public schools is probably the 
trend of the future-, but' if this is the case,, 
one must face up* to *the fact that "it may ^ 
mean enormous increase in costs of public m *<* 
education, either for instructional persdn- 
,nel or for equipment which will substitute L 
for instructional personnel - such as audio- 
visual aids, computers and the "like*. 

Administrative changes % which allow pu-' 
pi Is to mov^e through the curriculum at their 
normal* pace can contribute to the individu- 
alization of education, tiut in the long run 
someone must make the decision as to what 
the imperatives are and to see t*> it that 
they are provided for. 

Thus, if an . ins trjjment was Wons true fced 
for the purpose, one could measyte the ex- 
tent of the gain subtest by subtest over a 
short period. Comparisons could be .made for 
any other subgroup breakdown, such as toys 
versus girls, ovir t]ie seven-month period 
separately for each subtest, and one could 
evaluate this gain in raw score points or. 
standard %s core points, but certainly not 
"terms of grade equivalents or percentile 
rank. . « 9 

Organization df the Study 

The organization of this report is such 
that the results of the administration of 
Stanford Intermediate I Battery: Form X in 
the fall and spring for the random sample 
will be _pre seated first in several ways,' al- 
lowing the reader to draw whatever conclu- 
sions he wishes concerning present success 
or failure from these comparisons. Beyond 
our own commentary,, we will present alterna- 



tiyes e as to the configuration of any battery 
, to be used in such a manner. * ' 

Methods of Data Analysis 

« . 
At this point, it is quite ess-ential to 
call /attention to some o significant and un- 
usual^/ aspects of the data' finally available 
for analysis. , 

It is not the common practice to repeat 
the ScLme form of a test over a period. of 
time to measure gain because of the possible 
effects of remembering the answer given on 
the first administration. Factors of .fi- 
nance and logistics, I think, were uppermost 
in -the. minds of the State Department of Edu- 
cation when the decision'to use the same 
test was made. It was decided to leave the 
test battery* in the hands of the locaj 
school administrators so that only answef * 
sheets would need to be distributed in the 
spring - and^thus the distribution problem / 
and the matte^spf determiriirig- .the real^equi- 
valence of twosfb^ms purported to be compa- 
" rab le wouFd; be > ffypkes ed . , . 

- . ' - ' ?>v" 

^T'r -Reg^dless of the^geoeral merits of 
procedure, with whicft^^ie writer was 
first ' iriel^'ne^^to totally disagree * 'it^did 
afford an opportunity to make aTtyp£. of; 
analysis that^qould not otherwise hav£-b£en 
made; namely, the comparison of the respopsje^* 
made by each child to each item on each test--., 
fall and spring, so that it was % possible to 
study the consistency of response from fall 
to spring in a. very detailed manner.. 

However, before presenting these data 
it would be well to take a look at the 
amount of the g^ins that were found in terms 
of raw scores rot the fall as against the 
.spring test§ separately- by subtest. For 
this purpose it was, therefore, only neces- 
sary to compare the ran-dom sample fall test- 
t ing results with the spring results for the 
'"same group, of .pupils , This comparison,' in 
other words* cTi^hrnot involve Title I .chilf: * . 
dren because what' was being" attempted .at" - 
\is. point was merely to determine what, was 
generally a normal or typical^ <gain. so as. to 
provide a basis for evaluating kifasequently 
whatYftle I children did under tfie B^me/*" 

conditions. ' " '"V-.* . " 

* #•# 

In order to' accomplish what we/oieed-to 
know for this report, we only have to repro- 
duce a portion of the data appearing in the 
original statewide report, previously re- 
ferred to, which provided for the comparison 
of raw scores for fall and spring together 
with the raw score gains. This table also 
gave the amount of gain in terms of .grade 
equivalents - to satisfy the "believers 11 in 
this approach, especially the U.S. Office , 
, of Education ! 
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Since this 'study' ii' restricted to grafiie,[ ' ' The evidfen^' just discussed .filled- this. 





in the present projec 
therefore, reproduced 
the previous report. 1/ 
member that the gains 
d'om sample are over a 
ly. These gaias are- 

the 75th, 50th, and 25tft percentile ranks,'- • 

but the main emphasis is on the jnedian , , or \ New Data Available for Analysts 
50th, percentile rank.: -* \ \ \. I ' Tf7 v.; 'J 

An t examination ;of. the raw score gains, 

considering the medians only for the moment, ^ ^ ^ 

shows that they amount. 1 6 perhaps, a' point • wl^.hh ^eYe~Tti^ 

per month of ins tructipnal time for Word — * ^ ,\ 

Meaning, Paragraph Meaning, and Arithmetic 



V*'. At this 'poinjt, ,we mus't .shift .our atteri- 
' t$on to the, additional data wx'ung^ou* of the 
• testing pro^^in fr/~use o£ the^apswer sheets 4' 




Computation, but drop * subs tan tiaTly tq four _ r 

raw score points in Arithmetic Concepts and usingX^vailabjLfe co§e; i>uribepfi ^itiv'tHerr^- ' 

Applications. This drop, is due, .in subs tan- s.ult tjrat we c'am'e^up''wit^ t&e' number s'^of 

tial measure, to the smaller number of items cages "^feviously mentioned a%*a bi'sis .for 

in the last two arithmetic t^sts.: qpmp^Vis©n . Tfre$e }a^e'\cas^'f off'vffaAp we do 



n *th£\ random sample ^rrd^£or."«Titlfe, .1 chil J '-'* 
ren.^ x-. N ^ Y-\ V" ~ - 



t have •.compete data "for b v 6£h ; tho*se;:i ; noluded* 

In terms of grade equivalents., the * i n *tt* ' " ~ * " 

gains also appear to be "about: as 'one would dren. 
expect them to be, except that the gaitis in 
Word Meaning and Paragraph Meaning exceed. 

what one would expect over a seVen-month pe- ' Jiist whatVtfre loss of""c&ses *c(ue.-\to, inr 

riod, i.e. a gain of about one po£rtt per compTeierrgss *o.f\in dividual" pupil" data did, 4:0 x 

month of* school instruction. . > .* .they analysis .£s\Cb£ * course , impossible to* \ ' . 

: , : 7*-V' telV, ^Ufc .we .hayfe 'sjQme.^yta bearing 6h, the* • ; '* \. 

.Perhaps the most distressing "fact com- tested .rafc/lom. sample ^varsb's. the jfbpulation 

ing out of thi§.* comparison is the Relatively: 'analy^dt,. which js.eem 1bd\«§tab"lish rather . t ' 

small difference between children atjfche, ; , , ' .Cfifmlyj^he fact f that .the sample, drawn ran-. V" 

- J 7 5th "percentile rank versus the 50t^,...Eyen ^om.fy^rid "actually tested gives a remarkably 

the^iff^xeacaJietweexi the' 25th.. percentile clbse;'r£prgsentatioh of the performance of 

rank and--ehe'-75£k.Vis small. In' Word; I-If^n- ^ ^ thi' stite"as-.^y whole.. ^ * ^ < 

— iug., .^forexample , there is no_jraw ^.cor^*** ' / - ' : m 'js : X./ * .> k 

gain,""r:e. both ofrxtre^^ercentdJi'e ra^iks * '\ i'"-/' -.V v 

have co^ar^^6rc^tid^s , '^r "r a aw j^ore^, . r ^ ; "/,.,ttitli v £he" data dfl^IBX^tape ivatlabje. for , 

-^.wa'a *0per^ to us - njany„ of 
jCempil^ted^ajid. are , reported^ 
' * * > study, bbt 
;ther analysis. 




:>•/: 

tem analysis . ^ 
gave the 
$ ror/each of* . N V 
fin the 7 
- - By, the actua| 

^^U^Sbbnse ; number marked by the child - not ih 



Vt ... / 



;^Tt \* - % ^V-TV*^>: V< S» s ' • ' - ' ..^U?spbnse;nurr 

" ! •^r/:^/^* ^^^A ; r^V^V;v /TV V^^s-*6£'-ri!gftts , wrongs / arid* oTtiits. 

1/ M A De^cripfci^and'fey^^ 1 V\V< "A. key 'was included as.*an ixtra line onl* 

Statewide^e^tUng.Ptpgp^ I UC^tt^gs after ever^y fifth 'pupil f s falj/' 

shire in X9^S^!^d ^94^?0\^M&c the; v spring item .analysis ,d^ta*v The . compute*. 
Spons6rfjfLj0..ofiTiJ:i^ < \ {jH^td^ Constituting Figure. J -1 shows a }V 

cance of t!heJ3>^fa'jPotain ; e;d'*£c^: Ey^uation^ * tyHc^l^ge froni the randf(Jm i ample list- 1 * 
With This Actirvlc^V.-Preparedr^fhe Test, *Lvfci« .VSirailaf, paW ^rere^'av^il^ble for the:: 
Service and .*«-**'~<*<x^ v. \ \ Jl ° - -v, . . 
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' V The page from the random sample was se-r . eKanilpaLtiotiVof .£he/d.ata, i.e. what is some- 
1'ected to be as representative as possible , timesycalled r, £ yeb^.tng.. " . This relates in 
for the whole sample £nd i$ reveals some . / ^pffr^tfeiij/gy to -tfrg -raWfern -'of scores' that 
surprising information concerning test- .sujbfciyt«-ff e^enttaf es the guessing child ... $ 

taking behavior, especially , as regards. . V ^r o^oihe chllcl uho~£arely guesses. This ii ' * 

gues§ing tendencies. Some of it is s/5 subr'r ...tfi^^^ngfng-^a.tpern o£ marking from a sue- 
tie that pt. defies analysis in terms of the.. m /JjfeWTon of ^td'ghts^etr-tffte ieginrting to a 
usual typeis of statistical summarization • patVen^;£ypicsIrtoi? a "giiessgr," or a chance 
but' depends, indeed, on a critical visually .AF^ttemj?".'.^^^ V -;*<*• ~* / 
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Corresponding ia Selected Percentile Ranks 



* Test Items 
' ''Word Meaning 3B 

'Paragraph Mean. % .-6{r . 

Arithr Com?,.'' '39 

* Arit£\ .Concepts . 32 " 

' f . < . •' 

Ar±th. Applied " 33 



' .- ' RANDOM SAMBtE"" • 

^i^e Raw Scores 

,- Rank - /F^fl Spring Gain 



• 50.,/ 
25; 



2L.v- 2-7 . 

is*- ; " 22 . • 

lX).v 16^' 



75 

50 ' 

V 

.75" »•• 



. -30 . 
• -23'- 

.47 ■ 



50.i--.-Ml- 



?5 V " 

-75 
/50 
■25. 

- "75 
V50 . 
25 



12 
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- 16 
12 
9 



• 40 

: ■ 31" 
. '24 

V : -23 
—>18~ 
. 13 

20 

-16- 

11 

21 
, 16 
11 



6 
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6-' 

.10 

• .8 
• ' 7 ■ 

if 

! .5' 

iJ 4~- 
. -4 
2 

5 
4 
2 



Grade "Equivalents / 
Fall ' -Spring t Gain Dev.* 

*9 



4-; 9 
'3.9 
3.3 

4.6 
3.8 
3.0 

4.0 

3-r6- 



5.1 
4.1 

5.9 
4.7 

3i9 

5.2 
4.5 



3.1. f .3.8... 



( 



4.8 
4.1 
3.3 

4.6 
4.0 
1.6 



5*5 
4.8 
3.9 

.5.5 . 
4.6' 
3.9 



1..0 
1.2 
.-8. 

1.3 
*.9 

s 

1.2, 

' .7 
■ i6 

■ .9 
'. .6 
.'3; 



/ 



+ .3- 
*.5 
/ + .1 

+ .6 
+..2 
+ .2 

4 +.2 
.0 

0 
' 0 

-:i 

• 

+ :2 
-.1 
-.4 



A 



/:*Rppresehts theVDevia^ion ftom the Expected Gain of .7 of a Calendar Year, 
# often inaccurately designated as 7--mojiths of a School Year . 
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Test Scores Related to Information Theory 

\ In in forma tipn theory, there is a 
"sender" of* a message,, a receiver" of this 
message, and other interfering "noises," or 
"static," which keep the message from' being 
clearly or completely understood. The ex-%^ 
tent, to which this "static-" is, present is - 
measur*e of the extent the communication i,§* v 
impede$l;^or, if .it is .absent; is a sugges 1 
tion that the message communication \s en- 
tirely perfect (a- rare* phenomenon)-, " 

. •. . ./ i ■ • 

VJhen applied to tests, information the- 
ory postulates that, the basic purpose of a 
test is to provide r t^e .medium by which a 
child can communicate to his teacher , (or to 
others^ *what he truiy^knows and, what he do ; es 
not know. In this situation-it would seem f 
to, \f*e evident that anything that even re- - 
motely smacks of euessdne must be considered 
in the nature, of static - because it;, 
clouds the validity of thfe message the 
teacher jLs receiving from the pupil. ..In a 
nutshell, this constitutes the most impor- 
tant phase of this study .when* it is 1 " taken 
overall; namely, "the identification of ways 
in which the "message" may 'be conveyed from 
the pupil to the teacher with the least? 
static . * . . ' * ; 

Obviously this' may involve great # 
changes, not only in the nature of the tests 
being used but also in the conditions under 
which they are administered. . It; certainly 
calls for a, climate of confidence within the 
classroom which will allow the child ta feel 
free to respond to an item or not to respond 
to it; or hopefully, in the latter case, by' 
use of the unambiguous "Don't Know" space - 
which should be provided to permit him to" 
indicate to the teacher that he specifically 
does not know the answer to the question be- 
ing asked. 

A summarization of this "Dqn'tKnow" 
informations may , indeed, be the most impor- 
tant data coming to,, the teacher as a result 
o-f the test«item analysis * t5 * / 1 

' What this writer has .feo say Subsequent^ 
ly concerning the analysis of these data is/ 
clearly &nd purposefully reflective of this" 
basic point of view; namely, that testing , * 
does constitute in the educational field a 
kind of communication. The significance of 
our. success or failure is that the final va- 
lidity of the test rests on our ability ,to 
(Jo this. Even more -important is the confi- 
dence the teacher can" have that the stu- 
dent's response to a particular item 4§ tru- 
ly indicative of his gragp of the material 
'being testeid, both for the individual and in 
the needed groupings of this information. 
The child's true position relative to his 
peer group is also at stake. / 



i 4* 



The need for such normative interpreta-. 
tion seems perfectly obvious. However,' be- 
fore we get, to . the point of interpreting our 
scores ^i terms of norms we should emphasize 
once more tnat the building block of a test 
is the te^£ item; that the test item comes . 
s directly efut/of the context of th ^generally , 
available "instructional material; that the 
t^st-maker does n<st choose what litems he \ / . 
fityll ^tain qr leave out of the" experimefj- . 
v.J^al o^txyout test . but rather chops^s for' 
^Bie /final test items which are known by * 
^^nougtt ^chi^dren to m&ke it worthwhile to in- 
c/iy^e theiri .and eliminates some it?ems which 
are^Qholely mastered - probably because they 
are below gr^de level tor' the time of year** _ 
and the^grade/ level 'at which the«test is \ 
used. - / ' " - 

In this study, the Stanford JLntermedi- 
. ate I Test apparently, gave very good results 
. at grade 4/ .in J»erms of conventional mea- 
surement criteria, considering the fact that 
the subtests *are anduly short - having le^.s 
than "for tjy items in every te5t except- Para- 
graph Meaning. <JThe distributions of, scores 
"are fairly symmetrical and show the charac- 
teristics that a measurement person normally 
looks for when he is attempting to 'make use ' 
of a gjroup^test of this sort for; measurement 
purposes, i.e. reasonable statistical relia- 
bility and a generally satisfactory "status 
report" on the ablest §nd least able pupils 
as judged independently of the teacher's 
t evaluation . 1/ - 

For curriculum© purposes , however, Ihe 
'test was not long enough for tfie ^blest pu- 
pils and range of d%&ficulty too steep foT 
the least able - as will come out !in the 
course; of our subsequent investigations. 

% 1 
More About Criterion Re ference Testing . 

OA^ of the really active movements in « 
! the field of testing in recent years has 
been the development of what is known as the 
"criterion reference test," which presumably 
;is a test which teveals what the chilld has 
learned at the particular grade level at * 
# which y he is functioning' or earlier.! The 
'theory behind this approach is thatlthe cri- 
terion reference test, in contrast no the 
general achievement test, will reflect 
skills and knowledges of such paramount val- 
ue that everyone, ^or nearly everyone! in the 
group should be expected to answer the ques- 
tions correctly, i.e. to demonstrate \mastery 
of these stepping-stone atoms of in-sphool 

instruction, 

* * < 

h/ SAT national norms turned out to be on 

thejjard.. side for no discernible reason, but 

/ the preferred use of, local norms sidestepped* 
tnis * sh ortco ming. 
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# This is not the time and place to^dis- 
cuss criterion reference testing in detai'l, 

H except to emphasize the fact that the ap- 
proach tak^n.in this study is somewhat simi- 
lar tfo ..the approach that would be taken in a 
criterion reference test in that more atten- 
tion has been paid to the performance of in- 
dividual gup^ls on single items than is nor- 
mally the'Vase in analyzing the results for 
a general achievement battery. Thus, the 

. question of item validity is a paramount 
Issue. 

Man^ .people have raised the question as 
to Whether general achievement batteries may 
# be u&e'd in the place of criterion reference 
I*? "tes^ts This ,is a difficult question to ans- 
wer because it depends upon the extent to 
: which- the general achievement test used is 
properly graded for the group taking «Lt. 
Some tests, for example, may be too easy for^ 
the least able or even typical" fourth -grade} 
and definitely too easy for an exceptional; 
ly, i.e. above average,' able group - not', 
providing enough difficult material to 
ve$l the top ability level of the "most> 
able 11 children. 

Who should see that these imperatives 
are appropriately taken care of and not/ lost 
in the welter of activities which maHesr up 
thfe ongoing daily program of the public 
schools is a real question. For example, a 
child can hardly be expected to cope w£th 
the ciirficoilum of the middle school or the 
junior! high school who has a reading level 
of only average grade 3 pupils. • 

^ho is to monitor a pupil's progress 
and on » the basis of what objective data3 

There are* other implications gr.owing * 
but of the author's contention that the test 
^ situation is essentially one of , communica- 
tion between phe child and the teacher and 
vice versa. If this is true, tests and test 
results should play an important role. The 
most obvious implication of this philosophy 
is that the child should, by all means, know 
whether he answered "the question" correctly 
or whether he missed it and, if , he did miss, 
that he shall be given an^ opportunity to re- 
view or have additional instruction, as 
needed, in* order to learn the ^knowledge or 
skill and get recognition for^so doing" - un- 
less the item in question is dismissed as 
# one that is peripheral to" the course curric- 
r ulum or the imperatives of the curriculum. 

Similarly, it implies that the teacher 
must take the trouble *of .finding out who 
knows what, i.e. group, analysis , and to pro- 
. vi'de a program for those children who have 
mastered a particular item or skill or a 
grotip of items' or related skills, and a dif- 
ferentiated one for those who are in need of 



additional learning material as well as the 
encouragement f ; 6r others to go x on to learn 
at the rate commensurate with, their abilir' 
ties .7 \ 

Teachers will say that -to take such 
supplementary action for children who are 
atypical involves an effart on her part that 
is unreasonable to expect in light af to- 
day's denjands on people outside- the realm of 
their way of making a living. Remember that 
about two-thirds of the children within a 
particular classrooA that is heterogeneous ly 
grouped will be quite similar in their 
learning potential. There is no answer to 
this objection on the part of the teacher 
except to provide the help needed, as indi- 
cated above, in a suitable manner. In some 
cases, there certainly is no w^y^to avoid 
the necessity of the teacher aide, or some 
human being to be there to meit.a crisis or 
provide a learning situation as needed. 
- $ 

There ;is no doubt tha v t the ideal situa- 
tion in teaching is, as is so often said, 
"Mark Hopkins on s one end of the log and the 
pupil on the other." Such a situation nec- 
essarily implies the desired debate £>r. 
sharing of knowledge between pupil and 
teacher so that not only is a fact learned 
or a sKill mastered but -the child knows why • 
the given fact is true axM why other possi- 
ble answers are not true if there is a' 
choice. ' 
* 

The "able child in such a learning situ- 
ation is then encouraged to move ahead as 
rapidly as he can, while the slower child is* 
dealt with patiently and is given the requi- 
site practice and drill in learning those 
prerequisite knowledges and skills while not 
being deprived of his Share in the f un , that 
is an essential part of going to school*. . 

Treating test data as if it constituted 
the transmission of a message from pupil to 
teacher and vice versa has implications with 
regard to the -climate of confidence in the ' 
classroom.. It must lead, inevitably, to a 
'thoroughgoing consideration of restructuring 
Americati education in the manner indicated * 
above f , to provide for <£hese multiple levels 
of accomplishment at eabh major mileage 
marker which designates the piace where the 
child should be in terms of his mastery of 
the established hierarchy of knowledges and 
skills ug»to his level of learning ability . 

Differences -in effective learning^ate 
will not go away regardless of fervent argu- 
ments that they are environmental, not in- 
herited, a The fact that must be dealt with 
is tohat they are there - real, measurable, 
and constantly influencing learning. * 

Actually, the absolutely basic know* 
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ledges and skills in the lower grades center 
around the development of reading skills,* 
# t)fe development of vocabulary, and the de- 
velopment of ability to compute to the point 
Where number combinations and the lilce. are 
automatic. It is not the purpose of this 
paper to develop a scheme for accomplishing 
all this in an administrative sense, but* • 
simply to provide evidence arising from the- 
present testing program, described above, 
that this iis not being done if the t way chil- 
dren answer test questions Tnay be takTen as 
evidenced f this failure of their knowledge. 

Certainly, this study implies that 
gue'ssing in a multiple choice item situation 
has no place in testing and everything pos- 
sible must be done to insure that the child 
is encouraged to give an honest response 
which, more often than not for some, may be 
a "Don't Know. ,f l/ - . 

The problem of time limits in test- 
taking also is a situation that mu§£ be 
dealt with more imaginatively . Generally 
the practice is, in the construction of pub- 



lished achievement tests, to arrive at a 
time limit such that all but one or two pu- 
pils in the class will have an opportunity 
to do all that he can do* in the time 
allowed, This assumes that the aest. items 
are arranged in order of difficulty and that 
this order is more or less *s table . from one 
population to another, which is. generally 
but not always the case. 

Ways' are suggested in this stidy of 
compensating for less than adequate time 
limits by estimating a total score Ion a time 
limited test using essential item analysis 
information available, according tolthe pro- 
cedures to be recommended, and ranking the 
child as to his performance in schodl on the 
basis of this estimate rather than His actu- 
al score. Measures of school learning abil- 
ity similarly should be developed which do 
not depend solely upon how many itema a 
child can mark, correctly or not, witnin a 
given length of time. The prediction^ pro- 
cedure to be suggested in the following 
pages is quite as applicable to such gam- 
ing ability measures as they are to achieve- 
ment tests . 



\J In the 1958 edition of the Metropolitan Achieve- „ 
ment Test, the writer introduced for* the^Ftrst time 
.(to his knowledge) the "Don't Know" space as an op- 
tion. It was not to be scored but used as a way of 

keeping a child's test response intellectually honest. 

• 

It is best illustrated by its application in the *• 
Arithmetic Computation Test. In the handscoring edi- 
tion of this test, the work wa.s done inFthe booklet 
(work sample type) and the child transferred his an- 
swer to the. margin of-^he sheet - where it was scored 
with a strip key but with the teacher looking back at 
the work done.Jby the child in the process of scoring^ 
or subsequently. 

The Spelling and Language* Tests in this battery, for 
which the author was also responsible^ made similar 
use of the "Don't Know" space. The basic principle 
involved was the same; namely, to provide a way for 
the disadvantaged or unknowledgeable .child to escape 
the tra*p of having to answer randomly by marking the/ ( 
"DR" space as preferable^ to sitting, for a substantia*^ 
period of time doing nothing. * 

In t^ie machine scoring edition of the test, the Direc- 
tions for the Computation' Test specify that the child 
shal^ actually do out the work, as before^. The pub- 
lishers offered to the user an ."Arithmetic Worksheet" 
or optionally suggested that scratch paper could be 
used. The Directions for Administering : Arithmetic 
Computation (Machine Scored^ edition) say: 

"Work each example oh the paper provided. As 
soon as you have worked an example, find the 
three answers given for the example in the right 
Rand column of the test booklet^, Then, on the 
separate answer sheet, fill in the space under 



the letter of the answer which agrees with\your6f 
If you do not find your answer in the test book- 
let, fill ip. the space under NG (for not given) 
on 'the answer sheet . If you do not know how to 
work the example, fill in the space under DK\ 
(for don't- know)." 

The child's actual computation was to be left with\the 
teacher witn his work intact, while the answer shee\ 
Was sent for machine processing. 

Note that in the machine scored edition three possibl 
answers are given 'and, in addition to these three 
answers, an "NG" (Not Given) t response is provided as 
well a*s the "Dk" or "Don't Know" response. The NG 
response was a scored response, but the number of 
items so keyed in the test was minimal. 

In this writer's opinion, 4 the scratch paper was not 
a viable alternative because of the time required to 
copy the Comp'utation problem; but expediency won. out. 

If one analyzes this procedure closely, it is* seen 
that in essence, this is a job or work sample, and the 
marking of .the separate answer sheet is merely a v 
clerical task transferred to the child* in addition to 
the work he has to do in making his computation. »- 

The' 1970 edition maintains to-some extent the charac- 
teristics of the 1958 edition, but the separate con- 
sumable Arithmetic Worksheet is not available. 

The joker, is tha£. both the 1958 and \70 editions were 
standardized* using expendable booklets. Children 
were permitted to do the work in the booklet witfibut 
having to. copy off the examples, and the norms for 
Metropolitan were based upon this assumption. 
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SOME PERTINENT FACTS IN REGARD TO 
THE STANFORD ACHIEVEMENT TEST: 
INTERMEDIATE I BATTERY: FORM X 

It is impossible, in the short amount 
* of space available in this report', to cover 
all of the essential information concerning 
the Stanford Achievement Test,>1964 revi- 
sion. Few people who are acquainted with a- 
chievement testing, especially of the bat- 
tery type, can be found who are not familiar 
with the Stanford Achievement Test Series in 
general. It was the very first such general 
achievement test published and its publica- 
tion date . of approximately 1923 put it well : 
ahead of its competition in regard to s.uch 
battery-type tests. 

The next major revision was in 1940, 
followed by another in 1953 and, finally, by- 
another in 1964. At the time 4 this study was 
undertaken, *» the 1964 battery was the current 
battery .In use. It has -subsequently been 
revised and the new fomfe became available 
in the fall of 1973. . 

However, the* JLittle attention that we 
can pay to the characteristics of the bat- 
tery in this article must be confined to the 
1964 edition, Intermediate I: Form X, and to 
the tests in Word Meaning, Paragraph Mean- 
ing, Arithmetic Computation,, Arithmetic Con- 
cepts and Arithmetio Applications only. c 

This means that no* data are reported 
here concerning* the Spelling Test, the Lan- 
guage Test, the Social Studies Test, or the 
Science Test - although -data of a similar* 
nature to tftat basically used in this study 
are available for these other tests in the 
fall. Cpnsiderations of time and expense 
precluded the use of all of the tests in the 
spring and, 'of all of the tests in the bat-' 
tery, the ones that seemed to. be most rele- 
vant, and of most interest to the user were 
the tests in the gerferal areas of reading 
and word meaning, on the one <hand, and math- 
ematics, on- the other. * (See Table 1-2.) 

Reading continuesA:o be the outstanding 
concern' of school peof^fe so far as^school 
curriculum vs concerned, especially in'Title 
I and' similar programs', but not far behind 
is 'concern for arithmetic - achievement . We 
have gone through many curriculum changes 
during the period of the last decade. 'The 
traumatic experience [of a major revision in 
the mathematics curriculum (from convention- 
al to the so-called "modern" math) , espe-' 
cially in the middle gr.ades , is over and 
currently the trend is back toward a more 
conventional approach. 

Form X of the Intermediate I Battery is 
intended for grade 4.0 through grade 5.5; in 
other words,- all of' the fourth grade and 



.one-half of the fifth grade. All of the i- 
tems'in this test, therefore', should be ba- 
sically applicable to this grade range - 
with the -possible exception that s*ome ,very 
easy items may have- been included for the 
sake of giving "bottom 11 to the test for the 

'slow learners in grade 4, and some difficult 
items may have been included in order to 
give the test "top 11 for .children tested up 
to the middle of the fifth grade. 

The Intermediate I Battery used in this 
study thus is optimally placed for 'the des- 
ignated grade levels (4.0-5.5) and, there- 
fore, a very large proportion of the items 
should 1 be found within a typical curriculum, 
at grade~4. Just~haw many of these items 
are typical of the currteti-lum in New Hamp- 
shire can be "told by comparing the items in 
the test booklet with courses of stucjy 
available for the state. Variations 'in the 
•curriculum from school district to school ■ 
district also are of major importance. 

It Should be no£ed here that the repre- 
sentative sample used for this analysis, 
consisting of some 560+ students, ,.was chosen 
randomly from all, parts of the state" and, 
therefore, any validation that tries^to re- 
late the tests to the 'curriculum in effect 
in Community A versus Community B is doomed 
to failure. This may not be a serious mat- ' 
ter since the determination o£ the item con- 
tent for this battery was done in terms of 
examinations of textbooks awd related mate-, 
rials that were most generally used at the 
particular grade levels mentioned (4.0-5.5). 

In additiqn to a consideration of the 
published test, we should realize that this A 
test was preceded by an experimental edition 
used^ for. item analysis on a large and pre- 
sumably .representative population, and it 
was only on the basis of the item difficulty 
and item discrimination values so obtained 
that the final selection of items was made., 

A ^statement in the Stanford Technical •* 
Supplement , which is available for the se-' . 
ries , indicates that the intent was to maxi* 
mize* the, coverage at grade level by includ- 
ing items with difficulties corresponding to» 
this proportion: * 



4 Item 
Difficulty 

80-89 

70-79 
60-69 
• 50-59 
40-49 
30-39 
20.-29 



Percent of Items 
For, Gr ade 4.4 

KP 

10 
20 

20 ' 
20* 
M . 
10 
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Test 
No. 

1 

2 
3 

4 I 

5 • 
6 

7 
8 
9 
10 



Table '1-2 ' 
Composition of Stanford Achievement Test 
Complete, fcattery - Intermediate I: Form ,X 



No. of 
j^ems 



38 
60 
50 
61 
122 
39 
32 
33 
49 
56 



Test Name 

^Word -Meaning ' 
**^aragraph Meaning 
Selling , 
Worti Study Skills 
Language 
**ArithWtic Computation 
**Arithiietic Concepts 
**Arith^e\tic Applications 
Social Studies 
Science 



* Includes one NG (Not* Given) space 
** Included tn the present Jstudy 



Form ,X 




<3 r ~ 
:' ' 


No. of 
Choices * 


• Time 
limits 


0 




^> 




A 


' '"10 min 




4 


'V 30* " 


• 


4" 


15 




2m 


. 20 




2-4 


41 




5* 


" • . 35 ■• 




4 


20 " 




5* 


30 " 


«» 


4 « 


35 * 




4 


25 * 




t 







Composition of Otis-Lennon Mental Ability Test 
. , Elementary II L-evel : Form J 



f 



^0. of 
' Items 



80 



Test Description 
A Measure of Verbal-Educational^ 1 ^ 1 



No. of 
Choices 



Time 
Limits v 

40* min. 
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By reference to this table it is possi- 
ble to see that 60% of the items were sup- 
posed to fall generally in the range from ' 
40% passing to 69% passing. 

The experimental edition was prepared 
regardless of the- difficulty characteristics 
of the original items prepared and was, 
therefore, in a sense a mofe^valid test of 
the total curriculum for grade 4 than the 
, final test as represented in the Intermedi- 
ate I Battery: Form X. 

Additional information concerning the 
validity of the test is to be found in the 
Technical Supplement . • * \ 



tional material in reading or in vocabulary 
development. 



The basic reference, on validity appears 
on page 23 of the Technical Supplement and, 
quite properly, emphasizes the fact that the 
validity of the test must be determined es- 
sentially in terms of the local curriculum 
because of the variations in the curriculum 
from place to place. It also, however, 
points out that validity, in a general 
sense, is established by reporting the pro- 
cedure for determining the conte/it from 
which items were chosen for inclusion in the 
test; namely, the analysis of textbooks and 
related subjects. The specific content? of 
each battery is^ further defined in Appendix 
B, which contains item content outlines for 
most o.f the subjects^. - 

However, the following quite interest- 
ing sentence appears early in this Appendix: 

"Furthermore, the Word Meaning, Paragraph 
Meaning, and Spelling Tests in tha -upper 
batteries are of such a nature that Content. 
Outlines are not meaningful for them." 

It is not quite olear from the Appen- 
dix, and certainly not from this sentence, 
why the concent outlines are not meaningful. 
Is it that there is so little agreement as 
to 'content of reading materials with re- 
spect to vocabulary and type- of material, 
that this cannot be t generalized? This seems 
unlikely.* 



Ttie usual estimations of grade place- 
ment obtained by- doing readability indices 
seem not to have beem used for the Paragraph 
Meaning Test. , There is no reference to any 
sources, such as the Rinsland list or other 
word lists ( to show that the words used were 
categorized by grades in which they most 
commonly appear to justify the selection of 
words used in the*" Word Meaning Test. 

— -~ 4, 

• This leaves the local community entire- 
ly dependent on its own evaluation of the 
content^fojr Reading and Spelling - to agree 
or disagree that it is representative of the 
material*being used as part of the ins true - 



The content outlines, 
tests, on the other hand, 
and very helpful indeed in 
the content of each test i 
content outlines are used 
the test itself to relate 
local curriculum, one carm 
determining whether or not 
sure the objectives of the 



for the arithmetic 
are quite' specific 

determining what 
s , When these* - 
in connection with 
the tes t to the 
ot go far wrong in 
these tests mea- 
local curriculum. 



One might point out that at the time 
this test was used in this study in 1969-70 
the 'arithmetic test.s probably were more val- 
id than they were a't' the time they were 
tried out - because the 'authors and publish- 
, er of the 1964 Stanford found themselves in 
a dilemma^. Modern .mathematics was just in 
the process of being introduced and, antici- 
pating a test lifetime of ten years approxi- 
mately, x>x\e had to anticipate that modern 
mathematics would become the . dominant orga- 
nizational influence in the math curriculum 
at the lpcal level. - Henceforth, it was es- 
sential tp proyide content that would be 
satisfying to those who had adopted the mo- 
dern mathematics while at ^the same time pre- 
paring a test which would be functional in 
1964 when the revised test was published. 

One word of caqfion concerning all of 
the tests in the Stanford Batter/, or any 
other achievement battery, is- essential . We 
have referred to the fact that there was an 
experimental edition tried out on very sub- 
stantial numbers of children, carefully se- 
lected to be representative of the Country 
as a whole. This experimental edition, na- 
turally, contained .items which do not appear 
in the final edition. These items were 
eliminated essentially for two reasons ; 

1. The items proved to be too hard or too 
easy at the grade levels at which they were 
tried out ; 

2. Tfre \£ems proved to be faulty in their * 
^construction, i.e. they contained ambigui- 
ties or more than one correct answer and, 
because of these faults, had to be dis- 
carded. <^ 

The tests included in tnis^study are. 
listed <below with the numbers of 'items in 
each test and a statement of the item type 
used, which also is of great importance in 
-considering the validity of the instrument. 



Word Meaning - 38 items: 

0 Def inition-type introductory statements 
followed by four choices to correctly satis- 
fy the conditions pf the s definition. 
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Paragraph Meaning - 60 items: 

» * 

Paragraphs, each of which contains two 
or three^ completions . The words needed to 
complete' the "numbered blank spaces are pro- 
vided in the form of f6ur choices , only one 
of which .satisfies the demands of the^ para- 
graph. This is essentially* then, a "four- > 
. choice comp letion-typfe test.- 

Arithmetic Computation - 39 items; 

Four choices plus NG (Not;'Given) . These 
items are representative of various phases 
of arithmetic computation as clearly defined 
in the- Technical Supplement - content analyses . 
but better appreciated by an actual study of 
the booklets themselves . Note in the direc- 
tions for the test appearing in. the booklet 
that each student is asked to work each ex- 
ample first on scrap paper and then choose 
the correct response or, if his answer is 
not there, to mark the NG space. In light 
of data subsequently available, this is the 
important consideration . f 

Arithmetic Concepts - 32 items : \ 

Introductory sentences followed by four 
choices. This test includes a variety of 
questions, some of • which are hard to subsume 
under the title "Concepts For example, 
. the translation of a Roman number to an Ara- 
bic number is not really a measure of the 
extent to which the child understands the 
Roman numeral and can translate it. It is 
more a measure of the child's ability to re- 
late one numker in the Roman form to its 
counterpart in the Arabic form. More satis- 
factory in this respect is an item of /he 
general type indicated in item #10, where 
the sentence is: "Multiplication is most 
likely a series of - e. additions', f. sub- 
tractions, g. divisions, h. estimations 

Arithmetic -Applications - 33 items: 

Four choices plus NG. This test is al- 
most strictly analagous .to the more conven- 
tional arithmetic problem that has been used 
in the past. Note that the student is- sup- 
posed to work out his own answer on a sepa- 
rate sheet of scratch paper before marking. 



While the above information is helpful 
in/ defining the coverage of the test in 
broad general terms, there is no substitute 
for a careful examination of the test book- 
let for Form X and, hopefully, the r£latecl 
material to be found in the Directions for 
Administering , the Teachers ! Guide for In- 
terpretation and Use of Test Results and," 
most importantly, the Technical Supplement , 
to which' reference is ma<Je frequently above. 



A General Note on Item Difficulties • 

*r — ~~~~ — — ■ " 

It has been clearly stated above that 
the standard procedure has been used for de- 
termining item content for each battery; 
namely, an analysis of textbooks and related 
material generally subsumed in the content 
outlines in categories - with a count of the 
number of items appearing in the test cor- 
responding to each of these categories. 

In actual truth, while this constitutes 
a very reasonable way of making a test it , 
'certainly does not constitute a statement of 
the, materials that; one should expect stu- 
dents to master at the stated grade level. 
In othes words, all of the topics covered in 
all three arithmetic tests of Intermediate 
I: Form X certainly^are not goiYrg to be 'in- 
troduced and mastereH by a-11 or even a ma- 
jority of the population of students to be 
found in g^rade 4 in our situation. 

The Existence or Absence of an Hierarchy 



• Criterion reference-* testing, now popu- 
lar in some quarters, generally must assume 
an hierarchy in the area of presentation of 
material,, i.e. an order for the introduction 
of the materials so that knowledges and 
•skills essential for later development are ' 
taught and mastered before these new skills 
are introduced. 

• ' Such an hierarchy becomes fairly evi- • 
dent in arithmetic for some or, perhaps, the 
majority of the topics covered. For exam- 
ple, addition/and subtraction must t^wmas- 
tered first in the sense of the pupi^Vhaving 
a nearly perfect retention of .the 100 addi- 
tion and subtraction facts and also mastexed 
in the sense that the multiplication tables 
also 3re known to the point of* near 100% 
perfect recall as needed. However, as one 
departs from this simplistic approach to 
arithmetic computation and gets.into other 
aspects of the content, \the hierarchy is not 
as clear. 

Additions of long columns of numbers 
not only calls on the child to know his num- 
ber combinations, i.e. the 100 addition 
facts, but also to hoJ.d in mind constantly 
each new partial £um to which «he must add a, 
subsequent number , If the child, for exam- 
ple , is adding ten two-p t lace numbers ar- 
ranged in columnar form, he must remember 
eight partial sums before he » reaches the fi- 
nal sum of one column. He then must carry 
everything over a single digit to the adja- 
cent column apd proceed with the addition of 
this column in the same manner in order to 
get the final sum desired. 

It is difficult to place a skill of 
thjls nature # in an hierarchy, since what is 
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tesF to the somewhat, more complicated and 
abstract passages constituting the most dif- 
ficult parts of the test/ 



ven in the 
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February-March 
on program was 
the same fo*m 
namely, Form 
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involved is not something that can be total- 
ly learned but includes a readiness factor * 
that is more- of the general character of , 
mental ability-. 

In a sense,, it may be considered to be 
comparable to the memory or storage capacity 
of a computer; if a child doesn't have, the 
capacity for storing and retrieving the in- 
formation concerning partial sums,. the num- 
ber of such two-place numbers Ije can add to- 
gether successfully diminishes" rapidly. Such 
two-place columnar addition is sometimes re- < 
stricted so seriously that ( the child's limit 
may be just adding two 5two-plaoe numbers, 
especially if carrying is involved. Other 
children who have this capacity in excess 
may add almost a limit less' ^number of two- 
place numbers without difficulty. 

Going beyond arithmetic, however, -to 
reading, spoiling, social Studies and sci- 
ence, no—criearcut hierarchy seems evident 
at all. Perhaps in beginning reading the 
knowledge, of the sounds of letters and the 
ability td analyze a, word phonetically (and 
"play back thfe record, 11 so to speak, to see 
how a word sounds) and then to compare -it 
with the oral configuration, of the word 
which is in ''storage" may constitute the; 
basic characteristics of "reading potential. 

Instruction in reading, therefore, con- 
sists largely of the exposure of the child 
to large and steadily increasing numbers of 
words indifferent combinations. The mean- 
ings of these words must be carefully devel- 
oped together with the hearing configuration 
and with the difficulties of comprehending 
them in a. continuous passage being empha- 
sized. 

Typically, these skills should have 
been taught to near mastery level by the end 
of the third grade. Beyond this, the evi- 
dence for any hierarchy in reading instruc- 
tion more or less tends to disappear. An 
hierarchy, as such, almost completely disap- 
pears by grade 6 and, for the most part, 
very few pupils increase their reading skill 
(except possibly speed of reading) beyond 
the level developed by the end of grade 5 - 
unless they are so ap.t in reading that it 
becomes avocational and, thus, generally en- 
vironmentally developed and not just a mat- 
ter of exposure within the time allowed for 
reading within the public schools. 

This is neither the time nor the place 
to develop this concept in detpil, but in 
evaluating the paragraphs' included in the ^ t 

Stanford Achievement Test one must look &i \^of items as another ^example, but may not in 
them from the point of view "of whether*jtiffl|ip^ 
are graded in some obvious sense of the word 
as they move from the simple, short, uncom- 
plicated passages at the beginning of the 



Difficulty values are gi 
Technical Supplement , Appendi 
various tests included in thi 
difficulty values for the nat 
tion were obtained during the 
period when the standardizati 
Agoing on and are reported for 
of the test used in the study 
X: Intermediate I Battery. 

In this report, another, set of item 
statistics, based on the random sample, are 
given which* agree cldsely with the statewide 
data. A similar taM'e is provide^ for Title 
I for comparison purposes / s 

If onfe will refer forwsud to th<ese ran- 
dom sample difficulty values in Section 11^ 
it will be found to be rather extraordinary 
how closely the New Hampshire values follow 
the national pattern, very rarely being more 
than X07o out of the way in terms* of the per- 
cent 'jDf children passing the it£ms success- 
fully. 

Reliability \ 

Reliability intended to be a measure 
of the extent to which the response^ on- the v 
» test^are stable from one situation to anoth- 
er. Thus, if one were to give Form X in a 
given week and follow this by Form Y the 
next week, one would e,xpect this correlation 
to' be quite high. The sources of the lack 
•of identity of score (more properly defined 
as rank order) from form to form are not in- 
consequential ^ They might be considered ba- 
sically as follows: 

1. 1 The content is not identical. The 
. sample, of wo'rds used in one word meaning 
# test, or in reading length of the sentences 
and other characteristics, may vary from the 
first to the second Sorm, etc., even to a 
substantial decree. Unless this variation 
is systematic, i.er applies with equal or- 
proportional force for all students, the re- 
liability coefficients (inter-form correla-. 
tions) for these tests would t be affected. 

Similarly, the item content of the 
arithmetic tests may not be identical or may 
not be similarly ordered from form to form. 
There may be intrinsic difficulties in the 
separate examples from one form to 'another 
form for different individuals taking, the 
test. Any particular example may appear to 
be an equally good representative of a group 

vkS" - - 

^fact be so. Ml of the 'items comprising the 

population of three numbers multiplied by 
three numbers that could conceivably be con- 
trived will show.substantial and stable 

\ * . : 



-19- 



25 



Answer Sheet Study - I 



differences, especially for a particular 
child. (He may n^t have learned with equal 
assurance som^-atldltion fact or his multi- 
plication table entry at an earlier stage of 
his schooling.) 

2<A In addition, t}iere is one very im- 
portant source of unreliability; namely, 
wijat might be called n quotidien variabili- 
ty 1 1 ; that is, changes in . the chrl.d from day 
to day in the effectiveness of his perfor- 
mance depending on .the way rie feels, how 
strongly he is motivated, and other factors, 
tend to cause him 'to perform differently 
more or less by chance from one time to 

another. 

■« * \ 

This kind of stability in the test, 
however, can be more easily estimated and 
the amount of error can* be measured in sta- 
tistical tefms .and stated as the ^Standard 
Error of Measurement, which in this sense 
would involve only those parts of the varia- 
bility of measurement attributable to the 
instability of the child's performance - 
rather than the characteristics and content 
of the two forms of the test being compared. 

•^>3. Number of items in a test and the 
distribution of their difficulty values' 
greatly affects reliability, and Stanford 
subtests tend to be too sjiort V 



It '^iiould be pointed* out, also, "that 
the reliability coefficients as reported in 
the Stanford manuals are basically maximum 
values because they make use of 1,000-case 
random samples from the standardization pro- 
gram, not single communities. Thus, the 
variability of these populations is nearly 
as great as the variability of the total 
group and the reliability coefficients are 
maximized . 



The authors quite appropriately point 
this out in the Technical Supplement and 
suggest\that the Standard Error of Measure - 
ment isjperhaps the more stable way*tif ex- 
press!^ reliability - since the increasing 
^Variability is cancelled out when the, values 
required, namely the standard deviation and 
the correlation coefficient between the two. 
tests, are combined in the appropriate 
formula.^ • . ; 

The above is a rather-simplistic atp- ' 
proach to the question of reliability since 
it omits discussion of various -methods of 
obtaining these Coefficients, such as the 
split-half .method .as cojnpared to the Kuder- 
Richardson approach and -such modifications ' 
o'f the Kuder-Richardson formulas as Jvave 
grown up in the past few years - otie of 
which is used in the Technical Supplement . 

„ The basic, fact, remains that one must 
judge the reliability^^, the test as being 
stated in rather abp^fyitfe terms -as reported, 
and.it probably is not foo representative of 
'what might happen in a, particular community. 

What has been*said above cannot be con- 
strued as a viable criticism of the Stanford 
Achievement Test if one has .r*ead the Techni - 
cal Supplement ,qnd is appreciative of the 
Wet that the reliability 'coefficients as 
reported are maximum and that the Standard 
Errors of Measurement are better; statistics 
to reflect the. test's reliability.* 

1 * 

Some time naff .been spent on this dis- 
cussion of x.eliabi4ity because muc^ use will 
be made of correlations in this study and 
correlations among tests are,- in turn, 
greatly affected by reliability of the 
instruments . , , , 
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Part* I r 

Analysis of Data for the Random Sample, Fall 1969 and Spring 1970 



INyftODUCTIQN 

The earlier Title I Report;' entitled "A 
Description and mraluation of the Statewide 
Testing. Program in New Hampshire 11 ,(lD71) , to 
»which many references have been and will be 
made -throughout the course of this report, 
was, intended to investigate the extent to 
which Title I children (with the- advantages 
they Jiad arising from their participation in' 
the special activities and^ small group, in- 
struction, characterizing Title I) did any 
better,, relatively, than other children - 
either similar to themselves in ability 'and 
age, , or equally atypical of the group or \ 
grade represented. 

It must be said that in many ways this! 
report raised more questions than it an-. \ 
swered, and it very early raised in the mind 
of this writer the suitability* of a general 
achievement -type measure, such as Stanford 
Intermediate I: 'Form X, for the purposes in- 
tended; namely, to evaluate .growth over a 
relatively short period of time . * ► 

• * 
* To arrive at this conclusion means re- 
thinking some basit assumptions underlying 
the CONSTRUCTION of a test such as that . \^ 
used. Incidentally, the repeated use=bf ' 
this test'at the beginning and the end of 
the grade involved (and in this particular 
study grade 4 only) provided «an opportunity 
for the further analysis of data in new and 
innovative ways . 

Let it be immediately said that the 
questions raised eajclier had nothing to do 
with the quality ofjtest construction repre- 
sented *by the 1964 Stanford, but simply that 
an instrument .made rWr one purpose was, per- 
haps, being unwisely employed for another 
purpose. " • 

Stanford Intermediate I: Fornix was, in 
its time, a measuring instrument of unchal- 
lenged quality for the purpose of arranging 
children in rank ord^r of their achievement 
in the various subject matter areds in bo£b 

a reliable and. Valid manner and in accc rr - 

dance with the best procedures for test; con- 
struction available at "that time. If • r 

Since that time the series has been re- 
vised, but no attempt will be made in this 
report to compare t,he new test with the old 
since this would be irrelevant and immate- 
rial, m 

1/ It would be foolish to'Ndisregard the, fact 
that^ questions- were raised about the ^1964 
Stanford norraS. We are taking here not of 
norms, but of matters of internal validity 
and reliability. * * 



DISTRIBUTION CHARACTERISTICS > ^ 

Perhaps it would he well to begin this 
section by referring to the distributions of 
raw Scores obtained on the Stanford Achiever 
meat Test: Intermediate I: Form X in at 
least two basic areas, Word Meaning and 
Arithmetic Computation, for the random sam-. 
pie tested fall and spring. The similar 
distributions for the other three tests in- 
cluded in this study are in the Appendix. 

-Word Meaning represents" a test in which 
the schools cannot be held totally responsi- 
ble for the increasing vocabulary from grade 
to grade - since obviously much of a child's 
vocabulary ^ ( these days especially, comes 
from t:he ^neral environment in which he 
lives, * 

The impingement of television (and par-^ 
ticularly programs like "Sesame Street 11 and 
"Electric Company 11 ) is very hard to assess 
in general, but the fact that it does affect 
the learning of children has been pretty - 
well established. In addition, in the aver- 
age middle-class or upper-clasps home there 
is a v^ry substantial amount of reading ma- 
terial' available to children at their own 
level of development 'and much a dditiona Imma- 
terial is available whicji is suitable to Be s 
read. to them. The general environment ■ 
clearly adds to their mastery of^an oral \>o- , 
cabulary^ but/the extent this is true has' 
never been satisfactorily measured and prob- 
ably ne^er can be because the variables are 
too great in number and effect. ^ 



As" 'regards the 
disadvantaged homes 
whish.do not provide 
tioned above, one sh 
nize; the atypicality 
the Average Amer-ican 
Nevertheless , since 
schools is on the ba 
cabulary or intellec 
handicapped children 



Children who come from 
and from environments 

the enrichment men- 
ould immediately recog- 
of this situation in 
scene and allow for it, 
admittance to the' public 
sis of age and not vo- 
tual development, these* 
do constitute a part of 



-the^grade structure at the first, and* at any 
subsequent, grade in the schools of America. 

4 ' . . 

«■*" 

The general policy throughout the, .coun- 
try, for years, has been to promote children' 
more or less on .the Jbasis of chronological 
age regardless of achievement, resulting in 
large numbers of .underachieving children at, 
any grade. This' unfortunate practice, I ^ 
think, # ia*giving way to a more rational /pro^ 
cedure of attempting to provide a curriculum 
for each child more or* less in terms of his 
needs or level of development forgetting 
gsada level; but to say that mis -has been' 
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; FIGURE II-l 

Frequency -Distribution, Cumulative Percent Distribution,, and Stanines 
Plus Histogram Showing Shape of Raw Score Distribution Graphically 
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Frequency Distribution, Cumulative Percent Distribution, and Stanines 
Plus Histogram Showing Shape of Raw Score Distribution Graphically 
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effected- by the present time is to be # clear- 
ly overly optimistic. % 

The ungraded primary school waaf a move 
in this direction but was never universally 
adopted. It still represents a very logical 
place to begin a serious attempt. -at individ- 
ualizing instruction, but we have many tech- 
niques to develop and substantia^ change in 
our ideas about primary grades curriculum 
before we have achieved even, this relatively 
simple beginning. / * 

^ In light of tfhis background, let us 
consider for a moment what we see when wje 
loek. at the distribution^ of raw scores on 
the' Word Meaning Test for Stanford Interme- 
diate I: Form X administered in the fall of 
1969 to the entire state population at this 
grade level in New Hampshire. The distribu- 
tion we- will examine (Figure II-l) , and oth-, 
ers to follow, is not for the entire group 
but for a random sample, carefully drawn 
from the whole state,. whicH has been inde- 
pendently sftown to be reasonably representa- 
tive of the state. \f V 

In the first place, we find that the 
-distribution is more or less bell-shaped and 
-more or less symmetrical, although it is* 
'?juesr.tonaJ>le if it would pass a rigid sta- 
tistical test of being a normal distribution 
in the st'rictly mathematical sense. Exist- 
ing tests for this purpose are fairly rigor- 
ous and, although the number of cases (586 
in this- particular 'instance) is fairly 
large, itf is very doubtful if this distribu- 
tion would, be^ accepted as a random variation 
from a normal curve if an X . test were ap- 
plied. This really is of relatively small 
importance. "\^. ' 

%^./ ' 

The curve does sTtqw a def initepiling 
up of scores from the mr^tUe to the bottom 
and from the middle to the*io.p., with fewer 
and fewej: children earning yery , jhigh or very 
' low scores, in a clearly sys tetp&t^ic fashi6n«. 

The mean of the Word Meaning d,istribu-'» 
tion, as of testing time in the fali (Qcto- 
ber) of 1969, was 15.9 and the standard de- 
viation was 7.1. The number of items in, the 
test^is only 38. The highest score earned , 
in the* fall was 35, but there was one. case 

^-receiving a score of l! *, 

* • 

A test of 38 four-choice items answered 
purely randomly, without reference to A any 
textual material and without any application 
of thinking to the marking of the answeV 
spaces, would yield a mean chance score of 

1/ See earlier state report entitled H A De- 
scription and Evaluation of the Statewide 

* Testing Program in New Hampshire irt 1968- 

• 69 and 1969-70," (X971J • \ : 



one-fourth of the total number of items, or 
t 9.5, and a standard' deviation roughly equiv- 
alent to the number ; of x alternatives , which 
is 4 v In other words, -better than 20% of the 
chil'dren taking the test in the fall actual- 
ly made scores below the mean chance level. 
* • 

The reported reliability coefficient of 
this test on an internal consistency basis., 
is better than . 90. 

The question must^arise immediately,*^ 
- however, as to whether this was the appro- 
priate test' t„o use at. the fourth.. grade for 
the purpose intended; namely, that^of mea- 
suring the. extent of gain, or growth, ;in the 
grouD over a seven-month period by the- Title 
I children's- compared to the growth, in a 
random sample for the state «i.e . , the* pop- 
ulation presently under study. 

* To answer this question, one must look 
also at the distribution of scores for the 
spring (Figure II-2) . This shows many of 
'the same characteristics, but the slight, 
tendency toward a positive skewness showing 
up in the fall test now becomes a slightly 
negative skewness , and the mean s«cbre goes 
from, the previously quoted mean of 16, ap- 
"proximately, to 22 - while the standard de T 
viaticn remains about the same; namely, 7.3 
from 7.1. Thus, the raw score gain from 
fall to "spring over seven months in Word 
* Meaning is approximately six points. 

Now this is hardly enough average gain 
to measure with any confidence the gain of 
individual students. The length of the 
test, namely 38 items, is obviously too 
short for the purpose under any circum- 
stances, and the suspicion remains that 
there\is a great deal ,of guessing involved. 
* Even if this were not &o, it would be almost 
inconceivable that the two curves - i.e., 
Fall versus Spring - would reflect the same 
amount of gaip for all students, able and 
retarded. , 

The fact that th'e distributions are 
symmetrical and approach the normal curve 
\ strengthens rather than diminishes the hypo- 

\ thesis that guessing is a factor, and a ma- 
jor intent' of the present study is to try to 

, &sses^s the ef-fect of such guessing on the 
scores, both of groups and of individuals, 
atjd to make recommendations, finally, as" to 
hgw an improved type of instrument can be 
rn^de J:or the very specialized purpose of 
measuring gains over a short period of time 
fdr all pupils involved. 

\ 

x t perhaps it would be wise at this point 
to* consider the cdnditipns under which a 
nqfcmal curve would arise, assuming all of 
thfe marks made by the children taking the K 
tii^t were random and whether random curves' " 
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would reflect any gains over the stated time 
period. 

A random curve would result, if one were 
to hand out answer- sheets wi thout test book- 
lets and^ inform the children that their task 
was to mark the answers ajs^if^ tfcey were tak- 
ing a test. Scoring the test could subse- 
quently be in terms .of the established key 
for the test or on the basis of a random key 
chosen from a table of random numbers or in 
some similar fashion. 

Ttrg writer^has 'clone this any number of 
times a?\a graduate school exercise, but 
most recently has asked a colleague, Dr . 
George Prerscott at the University of Maine, * 
to repeat" the experiment - using an actual 
answer sheet for an actual test rather than 
a standard .answer sheet of the IBM type with 
150 five-phoice responses. ( 

The resdlts were consistent with the 
writer's earlier studies; namely, that the^ 
mean score was equal to approximately that 
which would be expected by chance and that * 
the standard deviation corresponded closely 
to the alternatives in the test. 1/ 

. There is NO reason *to expect that other 
than a chance difference would come about if 
the experiment had been repeated seven 
months later, however. In other words, the 
reported gain from fall to spring probably 
reflects change (growth) from fall to spring 
as a result of exposure and learning; but is 
it enough to be actionable or convincing? 
Is it free enough of guessing to make, the 
results convincing? 

Considerations of this sort led the au- 
thor to ttiink very seriously as to what kind 
of a test should be used to measure gain or 
growth over a relatively ;short period of 9 
time, and this constitutes the major purpose 
of this whole study - especially as it is 
affected by the factor of guessing and its 
influence on the nature of the score distri- 
bution. 

Before continuing^ let us next consider 
the situation , with respect ,to Arithmetic' 
Computation. 

The Arithmetic Computation Test of 
Stan-ford Intermediate I: Form X contained 39 
items as compared to tHe 38 in Word Meaning. 1 
However, the «s cores fagged from 1 to 29 in 
the fall, indicating that the test haci plen- 
ty of top X±.e., was harder) and the initial 
distribution of raw scores. (Figure II-3) , if 
anything, was somewhat more symmetrical. The 
mean was IB. 5 - standard deviation, 4.5*- 

1/ A summary of this Prescott study is 
available on request. \ « 



"in the spring (Figure II-4) , however, 
the range of scores was from 3. to 38,, which' 
is not surprising. The mean had jumped from 
11.5 to 18.3 (about seven points of raw 
score) , while the standard deviation had in- 
creased from 4.5 to 7.0, a very significant 
fact. 

These results illustrate the reason why 
, Word Meaning was contrasted in this study 
with Arithmetic Computation. The results 
very clearly show the greater effect of *in- 
school learning in the area of Arithmetic 
Computation as compared to Word Meaning. In 
Arithmetic Computation, very little inciden* 
t#l learning takes place at home. Programs 
' like M Sesame Street" do. not have the impact 
that they have in vocabulary , ^and probably 
very little incidental learning goes on at 
home in computation because of its special- 
ized nature. Family experience or community 
living is not that much involved rn this - 
area. 

In other words, a test intended to mea- 
* sure the outcomes of specific in-school in- 
struction is much more likely to be suitable 
for the purpose if the conterft is limited 
more strictly to the content of the curricu- 
lum, as clearly defined ii/ textbook courses 
of study and particularly the local curricu- 
lum, and not much affected by incidental 
factors . 

The. distribution of scores in Arithme- 
tic Computation for the spring testing pro- 
gram has the same general symmetrical char- 
acter as the one for fall, and in both dis- 
tributions there is an absence of a sugges-. 
tion of change in skewness from positive to 
negative - as is evident in the distribu- 
tions for Word Meaning. 

The number of choices in Arithmetic 
Computation is four numerical options and 
one option called "NG," or Not .Given, which 
is used sparingly^but is definitely used a,s 
a keyed response iter which credit is given 
by the authors of the test. It is intended 
as a kind of escape valve for the pupil who 
gets a wrong answer by his own computation. 

There still is the lingering question, 
however, as to the extent to which a guess- 
ing factor affects the- scores, in this irv^N. 
stance, in a similar way to that involved } 
in' reading. • 

The standard correction for guessing, 
which is the number-of-rights less a frac- 
tion of the wrongs equivalent to one less * 
than the options offered, has been shown re-, 
peatedly to be ineffective and to be totally 
inoperative if a child answers , or attempts 
to answer/ all of the questions contained in 
a tes t . 
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THE DIFFICULTTCHARACTERISTICS OF EACH OF 
THE ITEMS IN THE FIVE TESTS gEING CONSIDERED 

In Table II-l, item analysis data are 
presented^fdr the five Stanford Tests cofi- . 
Sidered- in. .this study for the random sample 
of , 567 students selected) for testing in the 
spring for whom fall test* results were also 
available.- The table referred to above pre- 
sents the data fof both Fall and Spring- and 
also presents data separately for Rights , 
Wrongs, and Omits. Finally, it presents a 
ratio of the Rights divided by the Attempts 
(R/A) , the significance of which will* be 
discussed in subsequent paragraphs . 

Let us consider, first, the percent 
passing the various items from 1 .to N in 
each test from the point of view of the or- 
der of difficulty. Starting with Word Mean- 
ing, we see that the items, even in the Fall 
administration, are generally on the easy 
side for this sample. No item in the first 
ten is passed by fewer than 60% of the pu- 
pils , and percent passing for most of these 
beginning items is much higher. 

Generally speaking," the authors and 
publishers of the test put the Word Meaning 
items in order of difficulty based upon the 
data from the tryout edition of the test, 
• from-'which the final forms were made, and it 
is interesting to see that even after the 
passage of some years a relatively small 
group representing a random sample of the 
fourth grade in New Hampshire shows essen- 
tially- that this order of difficulty has re- 
mained more or less constant - with a sur- 

. prisingly small number of exceptions . 

* *. 

Perhaps the first ten items of this 
test, if you* Consider both Fall arjd # Spring 
performance, could be considered to be suf- 
ficiently mastered at the en3 of grade 4, so 
that these words could be 'considered essen- 
tially to be in the working vocabulary of 
the children - assuming that the percent ; 
answering the questions correctly is not too 
greatly affected by guessing. The criterion 
used to determine -mastery is roughly 757o 
passing. 

- In neither Fall nor Spring does a large 
enough percentage of the group answer the 
questions correctly from #11 on to permit 
the assumption that the worcis'in question 
are in the working vocabulary of the chil- 
dren, and the last half of the "test (roughly) 
contains item^ of such difficulty that it 
would be quite unreasonable to suppose that 
the words were, indeed, part of the working 
voqabulary 7 6f the students involved. 



Turning our. attention now tc.Paragra 
Meaning and scanning the item difficulty 



values quickly, especially those for spring, 
we s§fe that a fair number of items, down to 
item #13, show a percent passing of .75 or 
higher; but beyond item #13 there are very 
few such items and after item #23 the items 
drop off very rapidly in difficulty or in 
percent passing. 

Paying attention now just to the per- 
centages for spring - that is, at the end of 
the instructional period - as we move On to 
Arithmetic Computation, we see the first few 
items sTiow a fair level of mastery, up 
through perhaps item #7, and then the items 
drop off quite rapidly until, after item 
#1,4, there are v^jry few items that exceed 
507o compared to the total number of items 
in the test. 

For all practical purposes, the last 
ten items' or so in the Arithmetic Computa- 
tion Test show negligible mastery, oh the 
part probably of the ablest pupils only, so 
we at this point face up very clearly to^the 
fact that this test is just not suited to 
the curriculum of New Hampshire, or perhaps 
it would be better to say it is not suited 
to the pace with which arithmetic is intro- 
duced or the amount of attention paid to it. 
Certainly if Stanford Computation Is to be a 
guide, the arithmetic situation was serious 
at the time this test was given. 

A word of * caution is needed- here . This 
is a test made to measure all levels of 
aFility - not an assessment of a fairly "lo- 
cal" curriculum. A "good" measuring instru- 
ment has a mean score at its optimum level 
of approximately one-half the number of. 
items ,in the test and the item difficulty 
values ranging from very low to very high; 
e.g., .10-. 90, possibly. This is why such a 
test serves so poorly to measure individual 
Spupil gains in a situation like this and 
hardly serves, even under optimum condi- 
tions, as a good measure of group gains. 

In Arithmetic Concepts there are very 
lew items overall, from the very beginning 
%f the test, where 75% of the children an- 
swered the question\correctly in • the spring. 
They can be counted'?^ the fingers of one 
hand, as a matter of fact.. 

Lqok£ng at this, test from the point of 
view of the criterion reference basic prin- 
ciple^of mastery of items in hierarchical 
fortp - that is, where a skill at a .given 
level is the basis for a more highly devel- 
k oped skill at another higher level - we see 
ttfat Arithmetic Concepts completely fails to 
meet thi-s test. 

The performance at' the end of the yejay^ 
is typically somewhere in the 50%. passing 
range up to item #26, with generously inter- 
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Table II-l, Page 4 - Item Difficulties, Random Sample 



V 



ITEM 
No. 



V8 

<* ' 

50 

51 

52 

53 

5^ 

55 

56 ' 

y 

57 
58 
59 
6o 



. Fall 

Spring 

F 

S 

f F 
S 

F 

S 



F 

S 

F 

S 

F 

S 



F 

S 



F 

S 

F 

S 

F 

S 

F 

S 

F 

S 

F 

S 



Par/ Meaning 
R * W 0 R/A 

.17 .36 .47 Ol 
.35 ~46 .19 .43 

.10 .ifO .50 .20 
.23 -55 .22 .29 

.27 .23 .50 .5*+ 
.47 .29 .24 .61 

.13 .30 .57 .29 
.23 .49*. 28 .33 

.05 .37 .58 .11 
.10 .60 «30 .14 

.18'. 21 .61 .05 
.35 .3*+ .31 .50 

.12 .2? .63 -33 
.33 .35 .32 .48 

.19 -17 -6<+ .52 
..42 .25 -33 .62 

.11 .24 .65 .30 
.26 .39 .35 .40 

.15 .20^65 .44 
.31 ; 33 J36 .48 

.16 .18 .66 .48 
.35 .26 .39 .57 ' 

. .06 .26 .68 .19/ 
.15 .44 .41 .25 

.09 .22 .69 .30 
.23 .36 .41 .38 

.07 .22 .71 .24 
.11 .,44 .45 .21 

.09 .20 .71 .31 
.17 -38 .**5 .32 



spersed higher values for a few items before 
this, but ^fter item #26 almost nothing is 
k shown' that 'indicates even understanding, let 
alonfe mastery. The figures reported could 
be actually the result of chance. 



In Arithmetic Applications " there are 
three items in the beginning of 'the test 
-that show a high level of, mastery, but the 
subsequent difficulty values^begin then to ^ 
fall off precipitously almost immediately. 
Item #7 reaches 75%, but it stands out as 
being very tftucfi the exception. 

. Continqing on through the test, the 
general trend is for items to be answered in 
the 50% to 60% % range down tp about item #22, 
after which there is another precipitous 
fall with as few as 17% answering item #28 
correctly. H^re, certainly, many of the 
items are measuring things that have not 
! been presented to the £roup formally or 
taught in any real sense of the word. It is 
the writer's best guess that, the performance 
here, while it looks fairly good, is largely 
the result of the, ability of the ablest stu- 
dents to handle the arithmetic situation "on 
their own.* 1 

In all of this discussion, especially 
of the Arithmetic Tests, a person" reading 
this study should have before him the test 
booklet itself -.so* that he can see exactly 
the/ Winds of items that children were able 
wuf urteble to answer in the spring of 1970- 
^md ask if this is a reasonable situation. 
In other -words, v?as the Stanford Test so far 
out of line with the Ne\: Hamps'hire curricu- 
lum thrit it never should have been used at 
this grade level? 



Table II-? 



* Correlations in Raw Scores * 
Between Otis-Lennon and Selected Stanford Tests 

RANDOM SAMPLE - Grade 4 - Fall 1969* / : 



Selected Stanford Form X Tests 

Word Meaning 
Paragraph Meaning 
Arithmetic Computation 
Arithmetic Concepts 
Arithmetic Applications : 



\ 



Raw Score Correlations 6f Otis-Lennon 
with Stanford : 
Data from Otis Manual 



NH Data 
Grade 4 

.72 
.73 
.42 
.65 
.60 



Grade 3 

.62 - 
.60 
.50 , 
.67 • 



Grade 5 
.77 

; :to 

! .73 
.75 
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Itern Peijormance vs Normative Interpretation 

Remember now, we're talking about indi- 
vidual items. Factors sfuch as overall „ 
wi i.e., average - % dif f icuity , norms, rank of- 
-der, and so forth, are of no significance. 
The number of- cases, amounting to 567 stu- 
dents, has befcn shown to be generally compa- 
rable to tjhe whole' state. It is large * 
enough so that the*£rrors of measurement in 
these percentages are small. 

We must therefore, in retrospect, de- 
termine 9 whether we can at all be satisfied 
with the arithmetic performance of New Hamp- 
shire studenfs if these data truly represent' 
what t^ey are able to do, especially consid- 
ering the fact that these are mostly five- 
choice multiple choice questions and \even 
the percentages as reported are inflated. due 
to the number of correct responses which are 
correct sheerly by random marking. 

We have not said anything, as yet, 
about the number of omitted items. Actual- 
ly, in an ideal situation a child should 
mark only the items he knows and omit the 
rest. Let us say that it is considered per- 
missible to make an intelligent guess in a 
four- or five-choice jLtem (Word Meaning be- 
ing four)* This would account for few addi- 
tional "Rights 11 due to M guesstimation n ; that 
is to say, partial knowledge is used posi- 
tively. Those children who have to guess on 
the meaning of the worcf certainly would not 
be qualified as being masters of the word 
with regard to fts use in general conversa- 
tion or in writing. # 

Yet it must be emphasized repeatedly 
that the content of this tes t s was . taken from 
sources which indicated they were generally 
recognized to be suitable for use in the 
fourth grade. Naturally, the words in. the 
total 'test haVe to cover a wide range of 
difficulty trecause the teacher has to cope 
\w*th;a wide range of ability, whether this 
is desirable or not, and this te&t was in- 
tended as a measuring instrument. 

Such a statement can be strengthened by 
relating, the Word Meaning data from tKe 
Stanford Achievement Test to information 
from the so-called intelligence test or men- 
tal ability test. In this particular in- 
stance, the Otis-Lennon Mental Ability Test 
was used and the re,sult§ of its use are re- 
ported in the aforementioned Title I Re- 
port. 1/ To amplify this we are including * 
here Table ,11-2 giving the correlations of 
Otis-Lennon with the five Stanford Tests we - 
are investigating for our own group plus 



1/ Page 19, Table III-B-2 



comparable sets of similar data for other 
groups . 2/ 

Many people argue that the Otis-Lennon 
Test is, ^fter all, essentially another vo- 
cabulary test - not too ^different from .the ',<» 
vocabulary (Word Meaning) test in the - Stan-' ' 
ford Achievement Test. The relevance of 
this comment ds pertinent^ to our problem. 
However, the Otis-Lennon Test measures far 
more Jthan just vocabulary^ - including (aa it 
does) arithmetical problems, spatial reason- 
ing pr6blems, analogies, and- a whole variety 
of mental skills and knowledges that /are not ' 
specifically curriculum oriented. 

It makes little difference whether the 
- skill demonstrated on the Otis-Lennon or 

uLher similar mental aoility tests- arises 
. from native intelligence, i.e. inherited* \ 
mental .ability, or from a good or po6r envi- 
ronment - whatever that might be. Whatever , 
it is, quality of environment is riot* to be 
measured in terms 'of dollars and cents of 
salary earned by the, parents of the child or 
children in question. Xhrs has been repeat- 
edly shown to be a fallacy in individual 
cases, even though there is a positive cor- 
relation as shown by group-type analysis. 
(See data from the .Metropolitan Manual for 
Interpreting , Revised 1972, 'concerning the 
relationship of mental ability to socio-eco- 
nomic status - e.g., salary of "parent? edu- 
cation of parent? - in the standardization 
groups for this battery.) 

. It may appear to strengthen the argu- 
ment of the environmentalists to note that 
, it can be easily shown that .not every word 
in the Stanford Word Meaning Tes.p* occurs in 
the curriculum for every school (or most 
schools) in the United States at- grade 4 or 
even the adjacent grades' of 3 and 5. On the 
otljer hand, analysis of the words that" are 
included in, the Stanford Achievement Test: 
Form X for the Intermediate T Battery shows 
tha,t they represent a good cross section of 
words occurring in the kinds of children's - ' 
literature to which the average child in an - 
average family' is exposed at this level of 
development . * • \ * 

The Vcurriculum validity 11 problem real- 
ly arises from an unrealistic desire on the 
part of schopl people and, more particular-* 
ly, parents and the public in general to 
•have childrpfr' master, everything presented to 
them within*«ttie- walls of the school at the 
grade levels specified. This is totally un- 
reasonable in the case of Word Meaning, es- 
pecially in view of the conditions as they, 
presently, exis_t," and there is, ample statist 

2/ Grades 3 and 5 correlations are from the'* 
1969 Otis-Lennon Mental Ability Test 
Technical Handbook-. * x 
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tical and common sensie evidence to establish 
this point. Arithmetic may be an entirely " 
different matter ,\ since environmental learn- 
ing is*much less* effective here. 

What then can we say about *the Stanford 
Achievement Test: Word Meaning: Intermediate 
I Battery: Form X as an instrument suitable 
for the purpose for which it was used; name- 

v ly, v to measure achievement in vocabulary at 

\ "the beginning and end of grade 4? 

* As a measuring instrument, it has 

served £he purpose well. In other words, it 
,has selected those individuals who have a 
high vocabulary and has similarly identified 
those who .have a paucity of skill in that 
area. This is very valuable information for 
the , teacher and is quite" irrelevant to the 
specific words which may be taught at the 
local level. * ^ 

As a matter of fact, there are few sit- . 
uations where vocabulary , as such, is taught 
independently of/the total language program, 
which includes reading, speaking, spelling, 
and the use of the English language in 
n writing. 

On ttie negative side, the Stanford Word 
Meaning Test is quite obviously toc> short, 
and therefore too limiting in proportion.' of 
words which will be found in a local curric- 
ulum, to measure specific outcomes^of even 
the most carefully planned "new" programs of 
instruction. Children will not have been 
exposed in a specific learning situation to 
a great number of these words , but will have 
* learned them quite* incidentally both in 
their schoolwork and in the home and commu- 
nity in general. A radical solution to the 
problem may be necessary, and in due time in 
this report we will attempt to attack that 
prob lem . 

In the meantime, it is essential that 
we turn our Ittention to the comparisons 



between the percent of items answered in- 
correctly and the percent cjf items omitted. 
What we find herfe is that the percent of 
items answeirari incorrectly is not too dif- 
ferent f^oiflBne percent answered correctly , 
except for the very easy or very 'difficult 
items," and the percent of 'answers omitted is 
substantially small. In other 'words, chll- 
dren are marking answers in far greater pro- 
portion than they would ff*-you, n i.e\ the- 
teacher oV the school, expected them to mark 
on ly those words where they felt they had a 
reasonable chance of really knowing* the 
word. In relatively few cases are they 
actually omitting items' in large aiumber ; 
therefore, the .case for random guessing^ is 
greatly strengthened and the validity of the 
test fox^ measuring anything is«weakene5. 

Let us follow up a little more closely 
the suggestion just made. The writer may 
report in this connection a fairly large 
number of instances, wh^re* he has queried 
children individually concerning th^ir test- 
taking behavior. Almost uniformly, the re- 
sponse was that they view a multiple choice 
question (or any of its variants) as simply 
a situation' where they answer the questions 
immediately, i- 4 e. perceptively, if they 1 
Know what the answer is. 

If they do not know, they canvass the 
possible right answers as given and choose 
the one that seems to be the most likely and 
mark it. IjE they can find no cluefe as to 
what the correct answer is among the words 
provided as alternatives, they 'simply mark ^ 
an answer by, # chance in the hope of getting 
an unearned credit, at least untii they rec- 
ognize that they arc simply beyond their 
depth. Even then^ a remarkable number just 
continue -to mark all answers in the test. 

The question for further study is^ "is 
Luis what children actually dc?? 1 ' The data 
to be reported later will reveal the extent 
to which this appears to be the case. 
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ANALYSIS OF PUPIL RESPONSES BY CATEGORY 

•One unique bit' "of information that is 
available is the result of the fact that we 
do have 'fall-spring item analysis data show- 
ing the response of each pupil to the iden- 
tical items on two occasions. The responses 
are separated by a period of approximately 
seven months. Thus^we are able to determine 
the consistency (or lack' of consistency) in 
the pupil responses over a period of learn- 
ing covering .the better part of the school 
year. 

£fe of. the first methods ofattack was 
to cr^fte categories of response which would 
describe how a *pupil had answered ah item in 
*the fall .versus the spring when these two 
periods were considered 'jointly . 

J J An example of this type of categorical 
analysis is - the "RR" {Right in the fall, 
Right; in the spring) category. An item 
falling in this category would be cotally 
useless for measurement 'tM* learning resulr- 
ing from a particular program of instruction 
since it .would simply- demonstrate that the 
learning that Jiad 'taken .place prior to the 
testing time in the fall was maintained 
through fre period of seven months. 

The individuals who were invblved re- 
sponded to«the*item ctarectly even after 
this passage, of time, barring' the quite re- 
mote chance of fortunate guessing fall and 
springy Result: teaching effort is ^wasted. 

*> 

The existence of such items in effect 
reduces the length of the test as a measur- 
ing instrument, the representativeness Of 4 
its coverage, and its reliability and valid- 
ity - whether thi9 test be Word Meaning, or . 
Paragraph Meaning, ,or Arithmetic . 

A logical analysis of the possible cat-* 
egories reveals tljat the ten decided upon 
would almost exclusively cover every pbssi- 
ble response a pupil might'make to an item » 
within the established response framework; 
i .e ., multiple choice wit^i answer sheet. 

All pupil-item' Responses (number o.f pU- 
pils times number of "items) are broken "down 
by category and presented in two tables. 
The two tables overlap in £fiat the pumberj 
of pupil -item responses *involved *in each 
category are repeated, but in one table are* / 
interpreted in terms of a mean per category, 
and in the other table, in* terms of a per- 
centage "ger category. . * *• v - * 

Interpretation in Terms ^ of Mean Per Category 

Let us consider, first Table* II-3, in v*- * 
which a value. thereln^labeled "Mean Re-' * 
sponses" is presented below the number of * 



« 

pupil-item responses in each category. 
These mean values were found, for example, 
by dividing the number of pupil-item re- 
, sponses under the category "RR" by the total 
number of pupils , which in the random sample 
was 567 cases including both boys and girls. 

(Actually *boys and girls were studied 
, separately, but no significant se£ differ- 
ences were found and, therefore, for this 
report the data are^pombined. ) % •* 

When this process is carried 'out, the 
quotient is the average number of test, items 
falling in that category for* the group 
tested. - ' v 

The results for all of the categories 
are interesting in that each reveals on6 
thing or another. For example, the fl WWS M 
(Wrong in the fall, Wrong in /feht^-spring, 
Same choice) category would suggest that a 
pupiL or a number of pupils might have had 
some positive misinformation which was pre- 
served over the period of time during which 
they were under instruction; while the ll WWD tf 
(Wrong, Wrong, "Different) category almost 
surely identifies those who did, not answer 
the question on a basis of specific know- 
ledge at all, but merely marked a response 
by chance . 

Similarly, the "00" (Omit, Omit) cate- 
gory represents the 'children who refused to 
commit themselves, either fall or spring, in 
a situation wherelJthey felt no competency. 
They are* temperamentally "no gues^ers?"^ " 1 

At this moment, however, we're concerned 
with two response categories which can be 
readily combined; 'namely, die "WR" (Wrong, 
Right) and the "OR" * (Omit, Right) responses. 
Only in the case^f* these two' categories can 
we concede that learning most, likely has 
taken place as evidenced by the test results* 
si^ice only in these, categories do we find 
that an initial response, which indicates . 
that "learning" or "mastery" has NOT previ- 
ously taken place, has changed fcp a response 
which t indicates that now the pupil may, in- 
dee^ have learned the answer to the ques- 
't;iops involved;, i.e. , to answer a question 
whikh. he was previously unable to answer. 

Continuing now with Word Meaning, for 
the sake of further illustration, when the 
WR 1 and OR" categories were added together 
for the random sample, the totail number of 
pupil-item* responses tfas 5,053. When 5,053 • 
is- divided by • 567, \ the result jis the average 
number of items anfwered ,in a? manner to sug- 
gest an increment in mastery of the material 
in question - in this, case vocabulary - dur- 
ing the seven-month period. This gives a 
mean number of items on which learning has 
probably taken place of 8.9. n 
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' J- V XX 

Note particularly that '.Qj^svdoes hot 

• identify the particular words \$i'cli have 

been learned, and that these woWs may. not \. 

indeed be the same from # pupil to pupil'; it 

simply emphasizes the fact' that .out of 38 

itdm$, a total population of 567 c^me up * 

with! an average of 9 items which appear to 

. have been learned during the seven months. 

A moment's thought makes it clear that / 
this line of reasoning cannot be followed in 
a single testing. Any fall Wrong or Omit 
can be transformed to a Right ^response in 
the spring because of real learning. Only 
the ^opportunity provided by the fall-spring ' 
analysis reveals the small average number 
of items learned. Similarly, some of/£he 
^RR n responses do not really reveal, positive 
, learning - because both* "Rs" may have come 
about by guessing, a real but remote' .possi- * 
bility. ( ; 

What is lacking, therefore, is prior 
assurance of a serious effort to test. what 
the teacher teaches during the seven, months 
in question - without encouraging "teaching * 
for the test." This "community" curriculum 
w <is only approximately "knowable" beforehand 
for any standardized test, and there is no 
infallible way of freeing the teaching situ- 
ation of the totally undesirable effect of 
the "coaching" dilemma. 



77 

vis* an* outcome of a good teacher -pupil rel- 
ationship . " . * 

Consider now, by way of reenforcement 
( of the above, the fact that all categories 
except "WR" and "OR" are in a sense "dis- 
abled" - in that they cannot reveal that 
any learning has taken place. 

If a child answers a question "RR," 
this simply means that he knew something at 
the, beginning and continued to know the an- 
swer, at the end of the period of instruct 
tion. A WWD" response is highly suggestive- 
of guessing; etc. If only 9, or less than 
one-quarter, of .the questions t show average 
positive changeover seven months, the test 
obviously cannot possibly be analytical for 
an individual child/ 

Unfortunately, all of the circumstances 
involved in the collection of these data 
suggests that the instrument was not an ap- - 
propriate one to prove the effectiveness of 
•instruction in the field of vocabulary de- 
velopment with this population. Any survey 
instrument, excellent though it is for the 
purpose intended, ^nnot be of sufficient 
effective length to establish curriculum 
validity for the individual school adminis- 
trative units involved, • 



An ideal test would be 
number of responses "wrong" 
of which were previously <fer 
as valid teaching objectives 
ing year. Items not taught, 
anyway, give false credit to 
items taught, but not" learne 
tions about the effectivenes 



one with a large 
in the fall, all 
tified locally 
during the corn- 
but learned 
the school; 
d, raise gues- * 
s of instruction. 



Subsets of 
selected from st 
propriate local 
test items based 
f for the year - a 
However, a desir 
quired practice 
specifically the 
ing efforts . 



locally valid items may be 
andardized tests by an ap- 
(logical) analysis of the 
upon the established goals 
long-recommended practice, 
able practice becomes a re- 
if the intent of testing is 
evaluation of local teach- 



^ This conclusion is obvious enough but 
is 'differently stated when one says, as 
^ above, that only the Wrong, Right or Omit, 
Right items can provide evidence of growth. r— 
The way to demonstrate more growth is to * 
• ^majce a special Variety* of test by which only 
-the -items taught are considered in determin- 
ing- changes attributable to the child's in- 
struction. Obviously additional determining 
fac<ocs^a v re the level of motivation in tak- ' 
4^g the test coupled with freedom from m , 

gue^ing. Not guessing by choice because 
one wishes to honest - i/e., to reveal - 
his areas of JLgnorance as well as knowledge - 



We turn now to Arithmetic Computation, 
in which most learning actually takes place 
in the school and not in the general envi- 
ronment. The average is 9.9, or about 10 
items or 10 learnings^resulting from the 
seven-month period of instruction. (The two 
tests are specifically chosen to provide a 
contrast because one is so obviously influ- 
enced by the^ general environment and the 
other one is not so obviously influenced *byik_ 
this environment.) 

• 

t Note that in both of the instances 7 
quoted above we are talking about averages* 
Thes6 are arithmetic means and, therefore, , 
ncf statement can be made concerning the per- 
cent of children learning more or l^ss than 
the mean - unless we can further assume that 
the distributions axB symmetrical, in which 
the' mean and the median would.be the same. 

The measurement of short-firm gains, is 
difficult indeed and is doomed to be incon- 
clusive'or ambiguous unless one can estab- 
lish that the 'knowledge involved was not , 
known at the beginning of instruction and, 
was Mastered by an established percent of! 
individuals at the end of the period of in- 
struction. Considering variations in the \ < 
title I projects submitted and looking also 
at the wide range of achievement and ability 
of a group of students in any typical class, 
the situation is even more complicated! 
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It, is also perfectly evident that we 
must have some assurance that the pupil 
group involved in the .experiment is able; 
that is, re^dy to learn what the LOCALLY 
VALID test" measures . 

We also must be assured that instruc- 
tional time allowed will be sufficient. We 
can assume about 180 days of. in -school time 
per year, or about 140 days in seven months 
between first, or fall^ and secoird (spring) 
testing time, but the minutes allowed per 
day are variable, both from subject to sub- 
ject and unit to unit. 

We can guess that the total amount of 
titne involved in actual vocabulary develop- 
ment, including or involving the particular 
wqrds'-in the SAT Word Meaning Test, probably 
would be small; but there are other factors 
involved, such as incidental outside word or 
vocabulary learning, which make this a bad 
subject for evaluative purposes . 

If the in-school instruction had as its 
mairi ptirpose the development of widely ap- 
plicable methods of word attack, the partic- 
ular, subpopulation of words in the test 
would not be as important. A pupil could 
apply these skills to answering any Word 
Meaning items - a desirable goal but one we 
cannot assume was characteristic of our pop- 
ulation . ' 

Let's turn our attention now to what is 
true of the Arithmetic Computation Test, 
where we can tie down much more definitely, 
what learning* tasks are facing the pupils of 
grade 4 during the seven-month period* under 
investigation if they are to cope adequately 
with thlr Stan ford Arithmetic Tests. 

If we assume the same 140 days of time 
and an allotment of one-half hour per day to 
instruction in arithmetic, with a major em- 
phases at thip grade level on computation, 
we come up with a total of about 70 hours of 
instruction over the seven months. Is this 
enough? 

Perhaps our estimate per day is too 
low. What if we assume 60 'minutes? Would 
that be enough? It would be a viable proj- 
ect to see what would happen, comparatively, 
if 50% to 100% more time .were allowed,- or i*f 
a small amount of time per d^y were devoted 
to maintenance of skills in oral arithmetic. 

If we further assume that this ins true- ' 
tion was carried on ,in the average self-con- 
tained classroom with its typical wide^ . 
spread of talent, it* is probably unlikely^ 
that more than half of the members of such 
heterogeneous classes ever could really mas - 
ter any except the simplest 'of the know- 
ledges to which^they are theoretically ex- „ 



PQsed <but which they did not partly know 
when the test was first •administered. What 
then? ' 

In point of fact it -is horrendous from 
a scientific point of view, to draw ^conclu- 
sions in any subject field»without knowing 
and stating these facts. God forgive us for 
what we do in the name of educational evalu- 
ation !© 

In defense of the instrument* involved 
(and of testing in general), it must be re- 
membered that' the content of the test was 
taken directly from the typical contebt in 
arithmetic computation texts for' grades 4 
and 5. The assignment of text content to 
grade is not a matter of 100% agreement, 
even in arithmetic! 



In other words, 
fast hierarchy that s 
learned before n B n an 
"C" ev£n in arithmeti 
ticulatly , in arithme 
an item which might b 
in one system or one 
signed ^to the third o 
er curriculum, etc 



there is no hard and 
ays that "A n must be 
d "B M learned before 
c - or even, mdve par- 
tic computation . Hence 
e a fourth grade item 
curriculum might be as- 
r fifth grade in anoth- 



This simply mean^ that, the content" of 
the test must be defined "in terms of the 
curriculum arrived at tby the agency 'which is 
responsible for making such curriculum deci- 
sions. - whether this is the local community, 
the county, or the state. y\ 

In New Hampshire (where this experimen- 
tation was carried on), theoretically at 
least, the decisions usually are mad& at the 
school district level or lower, without any 
really notable interference at the state 
level - although the State Department of Ed- 
ucation exeroises some isif luencftf in deter 
mining desirable objectives, e 
fields as specific as arithmetj 
tionl There is no mandated £ 
subject and no set course of 
all, must adhere. 



Interpretation of the Ten 
Percents 




cially in 
computa- 
book in any 
dy to which 



gories in 



pupil-item data* 
_ hod of interpr'eta- 
Ifc is intended to 



In Table ^11-4 the 
are presented, but the., 
tion used is different, 
reflect the proportion of all possible pu- 
pil-item responses, or interactions, that 
suggest" that learning has taken place 4g 
compared to the total number of such pupil-, 
item responses included in the test, cate- 
gory -by category. * 

*" " 
Th$ same argument given above holds 
here. .The only categories unequivocally re- 
vealing positive changes in the direction of 
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Answer Sheet Study II 

learning are the "WR" and "OR", categories . 
When the number of cases, i.e. pupil-item 
responses, in these two categories are com- 
bined and this number is divided by the to- 
tal possible number of such pupil-item re- 
sponses (which varies, oifeourse, from test 
to test) the results .show a remarkable con- 
sistence. * 

The percent of such pupil-item respon- 
ses which appear to fall in the probable 
learning category is 22% to 25%. In other 
words, 25% or less of the possible pupil- 
item responses indicate that lea.rning did, 
in fact, take' place • 

In view' of the four- or five-choice 
multiple choice nature of tKe present mate- 
rial, we need to be acutely awar.e of the 
"RW 11 and "RO" responses'- which suggest the 
fall Right responses were the result* of 
guessing in the fall. 

If a teacher is operating on' tKe basis 
of faTT cTata , she may be misledTv the fall 
response of those falling in the RW" and 
"RO 1 categories. Some fall responses are, 

Erobably guesses if the n RW"-"RO" data can 
e credited. In other words ) "money in the 
bank" by the fall performance was not there! 
Obviously, item analysis data are also in- 
vidiously affected. 

(The "RW + ROV and* "WR + OR" data are 
summarized in Table II-5.) 



Thus we must conclude that the analyti- 
cal response approach has the virtue of. 
alerting us to an often sensed but rarely 
documented fact that itemr analysis data can 
be misleading if based on a single measure. 

PerhapV fa comment is in order concern- 
ing the guessing (or forgetting) that does 
take place among those who mark an item "RW" 
or "R0. ,n Such "Right-in -the- fall versus . 
Wrong-in-the-spring" responses are particu- 
larly vexing because the fall item analysis 
~of Rights is so misleading. Nineteen per 
cent (19%) of the total number marking items 
Right in the fall marked the item Wrong or 
Omitted it in the spring. * /, 

There is no simple solution to this di- 
lemma j but several actionable approaches re- 
lating, to the scanning of the data for other 
evidence of a guessing tendency on the part „ 
of individual pupils may yet become clear as 
we proceed. c 

The inconclusive nature of the data 
*£hat we are able to present here, while very, 
helpful because it does reveal several lacks 
in* the test and/or this experimental setup, 
simply tells us that there are too many un- 
controlled factors to draw firm generalize— - 
tions from such survey test results over 
short periods of time and withc-ut specific 
item selection to create a subset of items 
of unquestionable curriculum validity at the 
local level. 



r 



Table II -5 

Analysis of Categories "WR+dR"* and "RW+RO"** 
RANDOM SAMPLE . 







No. 


Possible ' 


Selected 


Pupil-Item Responses 




Test 


of 
Itefos 


Pupil-Item 
Responses 


"WR+OR" 
No. Mean. 


% ; 


■:rw+ro" 

No . Mean 


A 




Word Meaning 


. 38 


21 ? 546 


5053 8.9 


23 


1694 3.0 


8 




- Paragraph Meaning . 


60 : 


34,020 


7998 14.1 


24 


3590 6.3 


11 




Arithmetic .Computation 


. 39 


22,113 4 


5630 9.9 


25 


1781 3.1 


8 




* 

Arithmetic Concepts 


32 


18,144 


4059 7.1 


22 


2162 3.8 


12 


* 


Arithmetic Applications 


33- 


18,711 


4158 7.3' 


22 


2270 • 4.0 


12 



* Wrong or Omit fall, Right'spring = possible gain 
*.* a '. i ; *ff Right fall, Wrong or jDmit ^spring = v posSible fall guessing 
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For example, we do xtbt know specifical- 
ly the amount of time assigned to arithmetic 
instruction and we do not know to what ex- 
tent other variables - such as the textbook, 
the general philosophy of the authors of 
J:hese texts (traditional jersus "modern) , or 
the competency of the teaihers .themselves - 
enter to determine the experimental results. 



Some of the^e factor^ can not, and per- 
haps should not, be controlled for all chil- 



dren tested, but at least 
tors should be recognized J 



conditioning fac- 



Sumroary of Category Analygis 

\ 

Each experimental evaluation of any Ti- 
tle I project (or similar ^ocal evaluation), 
as contrasted to comparison with a national 



norm, should be based upon a clear-cut 
statement of the objectives to be learned 
within the "grade - while at the same time 
recognizing the fallacy of assuming that all 
children in the grade are equally capable of 
learning. 

^tables .similar to the three involved 1 
her/2, representing the performance of tfte 
random sample, are presented in Part III for 
the Title I group, and notable differences 
in the performance of the two groups will be 
evident at that time and can be discussed on 
their merits . 

As "expected, the Title I group perfor- 
mance is lower, testwise, but there are rays 
of hope^in what appears to be improved 
leatning in relation to «knawn learning po- 
tential. 
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SOME CHARACTERISTICS OF STANDARDIZED TESTS 
RELEVANT TO THE RELATIONSHIP BETWEEN RIGHTS 
AND ATTEMPTS 

Perhaps this section should be initi- 
ated by pointing out that the, ideal rela- 
tionship between rights and attempts, in the 
case of a standardized test, is largely a 
matter of attitude; attitude of the school 
administration, of the instructional staff, 
and o.f the pupils . « 

> , 

First of all, the purpose of any in- 
school test is 'to find out how much an~Tndi- 
vidual knows about the body of information' 
assessed by the test. This applies regard- 
less of whether the test is a standardized 
test or is a local teacher-made* test. Stanr 
dardized tests, however?,, at& cons true ted in 
such a way that certain factors are intro- 
duced which relate, to, and affect the rela- 
tionship between, rights and ^attempts ; spe- 
cifically, the almost universal use of some 
form of multiple choice test most of the 
time . 

The very careful analysis, in the case 
of achievement tests, of the curriculum for 
the grade or grades in question prevents the 
introduction ,of material that Is not perti- 
nent to the universe of students to which 
the test is to be given. The test is often 
broken down into batteries covering one or 
two or, very rarely, three grades - each 
battery containing materials specifically 
identified with the instruction in that 
(those) grade (s). 

It is legitimate to cover two or three 
grades in some subject tests, especially at 
the upper grades, because the curriculum 
sources from which the materials are col- 
lected are not specific enough to permit the 
assignment of a particular question or item 
to a particular grade in every instance. In 
such tests, the number of items should be 
greater than in other tests where. there is 
more agreeq^nt as to grade placement. 

The relevant fact here is that nothing 
ever gets into the preliminary experimental 
standardized test until it has been justi- 
fied by determining that it does, indeed, 
appear in the appropriate curriculum materi- 
al for that grade (or grades). Not just one 
or two textbooks are analyzed, but a latge 
number of series are studied - together^ with 
courses of'study and other relevant curricu- 
lum materials,' including yearbooks of'na-'!' • 
tiorial societies and. the like. ' 7 ; 

In fact the experimental editions , .from 
the point of view of tHeir comprehensive- 
ness, igay even be more ^curricula valid ttt&n 
the final' editions of the t tests, which 'are . 
necessarily curtailed somewhat - due in part 



to the performance .of the items when they 
are actually tried out in school situations, 
but also dufevto limits of length relative to 
other tests in "the battery, time, limits , and 
cost. 

The aforementioned experimental edi- 
tions for item tryout purposes require the 
arrangement of items in judged order of dif- 
ficulty, so that the pupils taking the' test 
do not find the items in random sequence. 
This' is also a plus for the professional 
practices . 

^Subsequent to the item-^analysis and the 
re -examination of the -items*, , those items or 
questions finally^retained are arranged in. a 
.more precise, data based, order of difficul- 
ty - so that, ideally, except for the varia- 
tions that exist from comtrtwifcy-.-fco communi- 
ty, a child will answer, ffrsi^'a very easy 
item, next, an item of somewhat greater dif- 
ficulty, and so on, until he reaches the 
very h§rd items at the end of the test. 

It is also.. customary to conduct experi- 
ments to determine' *t"h~e overlapping of scores 
o£ tests which are adjacent in a series. If 
the test is a comprehensive one, both as to 
variety of subject matter and range of 
grades covered, it is called a battery. 

In the case .of the Stanford AcMevemefnt 
.Test, in generdl, each subject in each bat- 
tery was administered to adjacent grades. 

In the earlier days of. testing, (more- 
than at the present, perhaps) a further ex- 
periment was carried out to determine the 
needed amount of time to answer the ques- 
tions' in each test - so that a statement 
like the following is commonly made: "In 
light of the fact that thq items are ar- * 
ranged in order of difficulty, the time lim- 
it is al^^ long enough so that a given 
child canHjuBwer correctly any item& in the 
test whiefffl^ is likely to know. 11 ~ - ' 

It is never, considered desirable, from 
a test-maker's point of view, that the test 
score shall be enhanced by 'the effect of 
chance - although it is believed by this 
writer at this time, in terms of the data 
revealed by the present analysis that alto- 
gether, too much of this is taking plaoe, an 
intolerable>jamount in point of fact. 

The Rationale of Rights versus Attempts 

li the 'points raised in the previous 
paragraphs are "true rfs applied to a partidu-, 
• lar test, it seems qui te^ evident tftat. the - 
important thing to determirfe. for a testis 
how much time an individual needs to .dA all 
the items he is capable of doing. It iV*a 
good thing, rather than otherwise, to.stqiJ" 
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him before Ke has time to go on and guess c on 
items of which he has no prior knowledge. 

However, good or not, differentiated 
timing for individuals is something that is 
impossible to do - since the working time of 
individual pupils will/vary so much from 
test to test or area to area. Giving unlim- 
ited time can disrupt a class because some 
(one or two) students per class, dilly-dally 
along or are unable to complete a^particular 
subtest other than by guessing, while others 
can consume enormous amounts of time. 

It therefore follows, by logic alone, 
that if an individual answers Question #1 
correctly, Question #2 correctly, Question 
#3 correctly, etc., until he has. reached the 
point where he no longer knows the correct 
answer to most of the questions £nd thus' k 
finds that he is beyond his depth either by 
knowing or reasoning) and then stops, the 
correlation (degree of "togetherness 11 or 
correspondence) between rights and attempts 
will be high. 

Actually, how high the correlation will 
be will depend upon £he temperament of the 
individual pupils and their willingness to 
recognize that they no longer are answering 
the questions on the basis of knowledge but 
are guessing randomly. 

One.' would estimate, therefore, that the 
correlation between number right and number 
of attempted items in a valid test must be 
substantial; i.e., in the order of .85 or. 
.90. * 



The Correction for 

At this point 
sequence of discuss 
a period of years i 
tion for guessing, 
some - fraction -of -th 
act the occasional 
dividual would gues 
swering those items 
rest. 1/ 



Guessing 

ft 

we must interrupt this 
ion to point out that for 
t was felt that a correc- 
such as rights -minus - 
e-wrongs , would counter- 
incident in which an in- 
s wildly instead of an- 
he knew and omitting the v 



Although' the correction f#r guessing 
was largely dropped, generally nothing is, 
said in the Directions^tb emphasize that ^ 
guessing is not advisable or, in fact, is 
specifically mandated as being inadvisable. 
Certainly this was true of Stanford: Int. I: 

3rm X. This is a great error in tactics, 
aswill'be seen as it- is distussed later. 



The Correlation of Rights versus Attempts 
for the Stanford Achievement Test : 
? Intermediate I Battery : Form X: Grade 4 

The "author decided to determine, as the 
natural first ste.p more**fehan«*for, M any other 
reason, what the correlation between rights 
and attempts jre ally was in this instance. ' 
He anticipated that the expected rather high 
correlations woiild result. 

K 

In order to do this task, since comput- 
er time was not immediately available, it 
was decfided to use a population of 100 cas^s 
precisely, drawn randomly by sex; i.e.*, 50 
boys and 50 girls. (The rosters were so -or- 
dered.) This -sample was drawn and the cor- 
relations were" worked out for *the five tes.ts 
*with which the report is intimately con- 
. cerned. ■ • * 

The resulting pattern of correlations 
(Table II-6) seemed to make no sense 'whatso- 
ever. Even the highest*of them fell far be- 
low the standard expected levels, and some 
of them were low enough as to make it not 
too unreasonable to ask if the correlations 
were significantly different fronuzero! 

Even correlation ratios, unaffected by 
lack of normality and other population devi- 
ations, were computed without gaining any 
significant insight. The obvious negative 
skewness was not wholly overcome by the cor- 
relation ratios, (there are two for each 
scatterplot) . ,; * 

It was felt that there must be some- 
thing, wrong with the sampling technique (al- 
though the wrifer could not discover any er- 
ror)* and, therefore, arrangements were made 
to re-do this part of the project by comput- 
er so as to involve the entire population 
instead of a sample of only 100 cases. 

This set of calculation^ was done sepa- 
rately for the two populations with which 
this study is concerned; namely, the random^ 
sample of the state as a whole tested fall * * 
and spring and also the Title ,'l children, 
similarly tested both fall £hd spring. 

Table II-7 gives the results of the 
randbm sample analysis. Itis perfectly ev- 
ident that the second analysis strongly cor- 
roborates the analysis done the first time 4 
with ^respect to the low correlation values 
found. t K 

\ ^ 
There* is a clear-cut difference in the * 

r f s for ftieTast two math tGsts (namely Con- 

cepts and Applications) as~compa*red to Word 

Meaning, Paragraph Meaning,! and Arithmetic 

Computation . 



1/ See* Part I, pages 1 and 2. 



Since the second se 



t o\ 



correlations 
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Table II-6 



Attempts versus Rights - 100 Case Random Sample 

Correlations, Means, and Standard Deviations 
RANDQM SAMPLE of 50 Girls and 50 Boys 





Corr * 


Mean 


Standard Dev. 


Test 


r 


Attempts 


Rights 


Attempts 


Rights 


FALL: 












Wprd Meaning 




27.52 


16.24 


7 81 


7 


Paragraph Meaning 




45.98 


24.82 


11.53 


9.28 


Arithmetic Computation 


• 32 


28.3,7 


11.57 


9.£1 


* 5.05 


Arithmetic Concepts: 


.15 " 


28.69 


I3J2I 


4.28 


5.18 


Arithmetic Applications 


.18 


28.28 


12.50 


5.93 


5.34 


.SPRING : 


• 

S 


T 






* 


Word Meaning 




33.34 


22.42/ 


5*.87 


7.41 


Paragraph Meaning 


.17 


53.78 


33.10 


, 6.86 - 


IO.69 


Arithmetic Computation "* 


.30 


31.97 


' 18.05 


7.24 


' 7.38 


Arithmetic Concepts 


.24 


30.37 


17.23 


' 2.26 


6.31 


Arithmetic Applications 


.18 


30.98 ' 


*< 16.79 


1 2.33 


6.32 



had been done without paying any particular 
.attention to the shape of the separate score 
distributions, we went back to our data to 
examine this parameter to see if wp could 
find any causative factors that would result 
in this peculiar set of results. 

Bivariate distributions were available 
only for the sample of 100 cases, -but a more 
thoughtful examination of this small sample 
now revealed a potential piling up 'of cases 
at the top of the distribution on the at- 
tempts variable. 

This^ led to the distribution of at- ^ 
tempts alone on a univariate scale, tKe re- 
sults o"F"wETch are shown in Table II-8 (Dis- 
tribution of Attempts) for the random sam- 
ple.' / 

Ana lysis of the Univariate Scorre Distribu - 
tioftlfor sTcewness 

On this* table: (II-8) the piling up be- 
came painfully evident - with a very large 
but varying proportion of youngsters at- 
tempting all of the items, this table, how- 
ever, was not revealing with respect to the 
number of those iwho attempted all items but 



who, in turn, madehigh scores. 

This led, then, to the separate distri- 
bution of the scores for those children at- 
tempting all items ,. . The amazement of this" 
writer was very^ great to dis.cover that these 
reported scores ranged almost as widely as 
the distribution of raw scores on the test 
for the total £roup . i See Table 11,-9. (Dis- " 
tribution of Right "Responses for the At - 
tempted ALL Group) . '"^jgember : We are now 
considering only the <^pte random; sample ; 
the Title I group will be discussed later. 

There* were some few individuals who at- 
tempted all the^ itefms because they really 
were able to answer almost all of them cor- 
rectly. Thinking specifically of the- vocab- 
ulary test (Word Meaning), which had 38 
items, earned scores of ,j5, 36 and 37 were 
found among the individuals who attempted 
all items In the springf. 

The distributfori of right scores. for * 
those who attempted all items revealed the 
obvious; i.e., much guessing had takefi place 
and this indeed had inflated the scores for 
many <5f these individuals - although 15% 
earned scores which fell below the chance 
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' • ^ Table II -7 # 

Attempts versus Rights - 567 Case Random SamgLe 

Correlations, Means, and Standard Deviations 
RANDOM SAMPLE of -282 Girls and 285 Boys 



Test 

FALCV * , * 
Word -Meaning 
Paragraph Meaning 
Arithmetic Computation 
Arithmetic- Concepts 
Arithmetic Applications 

' SPRING: 



Corr . 



Mean 



Standard Dev. 



r 


k* Attempts 


Rights 


Attempts 


Rights 


. -55 


(.84) 26^.79 


15.69 


7.70 


7.10 


.49 


(.87) 45.68 


24.29 


12.56 


9.54 


.29 


(.96) 27.81 


11.46 


9.14 


4.51 


.26 


(.96) 29.26 


12.91 


4.75 


5.20 


.29 


(.96) 29.13 ■ 


12.77 


6.19 


5.12 






* 


r 





Word Meaning 


.58 


(.81) 


33 


.34 


21 


78- 


" 6 


.34 


7. 


34 


Paragraph Meaning ' 


.36 


(.93) 


54 


03 


31 


90 


9 


.79 


10. 


53 


Arithmetic Computation , 


.35' 


(.94) 


32 


08 


18 


25 


7 


.76 


7*. 


13 
0 


Arithmetic Concepts' 


.33 


(.94) 




|72 


16. 


26 


4 


05 


6. 


25 


Arithmetic Applications 


.2*5 


(.97) 


31 


46 


16. 


10 

k 


4 


10 


,6. 


34 



* Coefficient of Alienation 



level* (9\5) on the Word Meaning Test in the 
fall. ^ 

' '' 
In other words, even though the scores 
were so low th^t they could have been rea- 
sonably gained by marking the answer sheet 
without regard to the test booklet, these 
students marked all items. 



A Capsule Review of the Above 

What we have now determined is that 
what was considered to be the normal pattern 
of rights veraus attempts does not exist for 
£he random sample population. For those 
children who mark-all-of-th^-test-questions , 
the range of scores is almost as wide as the 
range, for the total population - including 

.the individuals who did not attempt all 

"items ! . • 

7 The inevitable conclusion that must be 
drawn is 'that guessing was "rampant in this 
population and that the general psychology 



for taking t the test was one of: (1) answer- 
ing an item without careful consideration of 
all alternatives if the* answer was known; 
(2) if it was not known, -then either estima- 
tion or sheer gpessing was resorted to as a 
way of enhancing the individual's score on 
the test. 

, ^Let it be made abundantly clear that 
the fact that an individual marks every 
question on the test does not necessarily 
guarantee that he is a guesser as compared 
to one who attempts only items he reasonably 
thinks he can answer. This means our "at- 
tempt all 11 distributions are affected by the 
performance of the very able . 

• Let 4t be equally clear that if an in- 
dividual marks all 38 items (on a test such 
as Word Meaning) >and comes ,up with a final 
score that is at or near the guessing' level 
or not far above it, the conclusion is 
equally inescapable that this result can * 
come about only by a very inordinate amount 
of guessing. " * . 



-25- 



g 

51 



Answer Sheet Study - II 



Table II-8 



Distribution of Attempts with Means and Standard Deviations 



Random Saap],^ f 567 Boys and Girls 



No. of 
ATTEMPTS 

60 

" 59 
58 
57 
56 
55 
54 
53 

I 

• 50 
1*9 
48 
47 
46 
45 
44 

43 
42 
4l 
40 
39 
38 
37 
36 
35 
34 
33 
32 
31 
■ 30 
29 
,28 
27 
26 
25 
24 
^23 
22 
21 
20 

' 19 
18 
17 
16 
15 

• s 

2-10 

N 



Wort 
Meaning 
F |S 



Paragraph 
Meaning 



L124 



Mean Att, 
Std. Dev 



4 
20 
17 
13 
12 
24 
17 
29 
16 

31 
17 
30 
21 
17 
15 
20 
21 
20 

19 
14 

13 
. 4 
14 
3 
7 
6 

_2 

565 
27.8 

, 8.5 



20 
27 
22 
17 
18 
14 
21 
17 
21 
20 
18 
12 
Ik 
k 
8 
8 
10 
6 
k 
3 
k 
2 
2 
1 



564 
33.9 

e.k 



m 

15 
19 
6 
12 
k 
3 
7 
■ 3 
14 
12 
11 
33 
10 
11 
12 
16 
9 
13 
Zk 
18 
7 
20 
9 
5 
21 
8 
17- 
7 
13 
9 
9 
9 
3 

6 

5 
12 
5 
1 
1 
1 
1 

1 
1 
1 



564 567 
46.0 54.1 
12.8 10.1 



17 

20 

9 
11 
15 
11 
9 
4 
8 

13 
11 
29 
8 
12 
7 
7 
6 
6 
17 
6 
4 

10 
5 
4 

12 
1 

5 
1 
6 
2 
3 



Arithmetic Arithmetic Arithmetic 
Computation Concepts Applications* 
F S F S F S 



\ 



20 

7 
10 
10 

9 
15 
11 

9 
12 
11 
15 
12 
17 
15 
22 
26 
25 
26 
34 
26 
15' 
. 12 
*tk 

8 

\ 7 
5 
4 

566 
27.0 
9.2 



207; 
27 
12 
19 

24 * 
17 
n i6 
0 19 
15 
16 
25 
19 

17 
11 
17 
17 
12 
18 

7 
11 
10 

5 * 

4 

6 

2 

X* 

5 

■2 ' 
564 

33.1 t 
7.7 i 



25 31 



20 

8 

15 

21 

4 
18 
18 
18 
14 
11 

6' 
10 

4 

3 

2 

1 

5 

2 



18 
15 
7 
7 
10 
4 
3 
6 
1 
5 
5 



566 562 j 
29.1 3Q.-9 
4.8 3.0 




35 38 
15 17 



20 
23 
15 
if 
16 
11 
16 

4 
17 
13 

7 
11 

4 

5 

7 
f. 

6 
4 
3 



16 

9 
12 

6 

8 

9 

2 

7 

2 ' 
5 
'2 
> 5 
2 

2 

1 
1 



-A 

563 563 
^29.1 31.5 

6.0 ■ 3.4 



X AttiAll 122% I 148% 



1 22K j L48%[ | 23X1 I A9%| (23^ p8£| O £3 I 74% | 
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Tabl/ II-9 ) * 

Distribution of Right Responses for Students # Who Attempted All Items 
Random Sample of 567 Boys an* Girls 

o^™* W ° rd ' Paragraph Arithmetic Arithmetic Arithmetic 

RIGHTS 4 Heani^g Meaning Computation , Concepts Applications 
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J. 
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1 
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2 
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35 


2 


Q 
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1 


34 


1 


10 


1 


j 


t 


T 
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33 


1 




1 


7 




c 
? 


32 


1 


■^-^ 


If 


7 




5 


31 


3 


19 


6 


5 




p 


30 


3 


'6 


3 


10 






29 




21 


7 


11 


1 


If 


28 


10 


16 


7 


2 




/. 


27 


7 


16 - 


2 


7 






26 




18 


3 


8 


* 1 


q 


25 


7 


14 


7 


10 


1 




24 


3 


13 


,3 


If 




ft 


23 


5 


11 


5. 


9 


J 3 


12 


3? 


2 


8 
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3 
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* 21 


1 


10 


4 
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1 


.12 


20 
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14 


6 


4 
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10 


# 19 
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14 
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11 
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Total N 124 


269 ; 








SOT, 


#ofR»S. 


22* 


■ *** 


23* 


:> 23* 




Mean * 


19.6 


25. 1'* 


27.1 


33.9 


*3.4 ' 


19.3 


S.D. 


7.7 


6.7 


9.6 


11.4 


4.7 


7.6 



(32) (33) 
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32 
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13 
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24 
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11 


20. 


18 


22 


13 
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29 


14 


25* 
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310, 
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52* 


74# 


13.5'* 16,9 


13.4 
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THE GUESSING INDEX OR 
THE RIGHTS /ATTEMPTS RATIO 

Our studies 1 to this point seemed to in- 
dicate the need for further investigation of 
the significance of the Rights versus At- 
tempts information. -Consequently, a new 
line of investigation was started; namely, a 
study of the behavior \pf a ratio comprised 
of the Rights divided by. the Rights plus 

Wrongs, or Rights /Attempts * (R/A) . 

» 

Ir\ order to do this % as expeditiously as 
possible wifrtfouf getting 'involved in machine 
analysis, the reverse side of some blank IBM 
cards were used to make up a record card for 
each pupil, a copy of which is shown in Fig- 
ure II-5. 



As shown, the card now/contains the 
marginal information from tme rpsters - 
which consisted' of Rights^ Wrongs , atid 
Umits - and from thisjj^ derived the number 
of Attempts by adding Rights and Wrongs and 
computed the R/A ratio by dividing the 
Rights by £he Attempts. 

i 

x This' information was recorded on the 
cards, which were then sorted irv R/A order 
separately by test. The distributions were 
transferred to Normal percentile Charts and 
cumulative percents wei;e calculated and 
plo^tfed. 



Answer Sheet Analysis 
SAT 1964 Ed. Form X I ft*- I Grade 

Pupil No^-r^ 



Pop : RS_ 



Ti-I 



Test 



. 0 



8 i . f /y _2<l 



6 

8 

L 

9 

S. 

*? 

G- 

Z 

\ 



1 

2\ 

3i 

4 ; 
5 

si 



A I-G ! 

2A ^ 
5 ' 



2. F 2Y_ 32. 



60 



6 

J/7 7 



rr 9 



3 - f n 29 



s 21 n 



s lb 

F I L 



/ ± 



3± :/o 



3 2 



1 

2 
3 
4 
5 
6 
7 ■ 

:o 8 



o 



3 3 ££ 



1 : 

i\ 

5 I 
6! 
71 



Boy_ 



Girl ^ 



"'Figure II-5 
Pupil Record Card 



The p^bts were drawn from point to 
point to chetek on the shape of the distribu-* a 
tions . Since this was done separately for 
fall and spring, each chart .contained two 
distribution's. 

It turned out quitefcfearly that the. 
amount of guessing involved, indicated by 
the, diminution in the size of the R/A ratio 
as related .to score, w^e more substantial in 
'the fall than it was in the spring. This 
meant more guessing in the fall than in the 
spring. . . 

Although one can Rationalize as to the 
reason for this, this writer knows of no 
statistical method' to arrive at any final 
explanation. Possibly the children, Jiaving 
a long summer. vacation and being faced with 
content mostly related to the year aheati, * 
were impelled to guess more in the fall in 
£n attempt to make a good record. 

One pair of distributions, plotted on a 
Normal Percentile Chart in the manner indi- 
cated, is reproduced as Chart r II-l. Since 
all of the charts essentially follow the 
same general pattern, the others will not be 
reproduced . 

It will be seen that, the lines from the 
10th to the 90th percentile ranks 'are fairly 
straight. (A straight line means a normal - 
curve on these charts.) Any tendency to 
■ curvature in the line appears above or below 
these points*. 

<* 

This was also true of the charts which • 
are no£ reproduced , . t so we can say essential- 
ly that the metric involved here', whatever 
it is, is one which is fairly symmetrically 
distributed. In other words, there is no 
skewness in the R/A ratios to reflect % the 
skewness which was discovered when we made ' 
distributions of the "attempt 0 scores earli- 
er. • * 

The range of the R/A ratio is almost 
unbelievable. It goes from practically ,0Q 
to 1.00, indicating that the amount of 
guessing just has to be «very, very substan- 



-28- 



0 



54 



-"v. . 

v 



Answer : £heet Study - II 



Chart II-l 



Plot 'of. the Index of Guessing * ^ 

Showing Decrease in the Anounfof Gufcssing in the Spring Compared to Fall 

* ' RANDOM SAMPLE 

NORMAL PERCENTILE. CHART 




*Or»fht «r* ptotttd «n th« tlrp Intcrvtl 
e*t««tt«4-st '»»* top ti th« chart. 
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tial unless the logic which preceded this 
phase of our, study - namely, that Rights and 
Attempts should not differ too greatly and 4 
that the correlations between the two should 
be essentially high, as they *were not - was 
incorrect. ' ' ' - 

Means and standard deviations were also 
commuted for each of the major groups, and 
the &e are shown in Table 11-10. 

It was clear from the above that the 
Rights/Attempts ratio had certainly earned a 
place in our consideration of , the -data in- 
volved in interpreting such test scores. 
Our first hope, that this would prove an ef- 
fective substitute for the typical correc- 
tion for guessing, pfbved to be a vain hope. 

* In the first place, high R/A ratios 
could be obtained by an individual who at- 
tempted a very small number of items but an- 
swered most of these correctly. This prob- 
ably is a valid indication of this individu- 
al's tendency not to guess, especially con- 
sidering the fact that the items are ar- - 
ranged in order of difficulty. 

However, other instances where only a 
few items were attempted and only half, per- 
haps, of these were answered correctly indi- 
cated that the ratio certainly was not com- 
parable from one part of the range to anoth- 
er, since a ratio of .50 based on 3 Rights 
and 6 Attempts is hardly a dependable sta- 
tistic as compared to one based upon sub- 
stantially larger numbers pf 'Rights . 



The ratio works best for the middle 
two-thirds of the distribution of Rights, or 
approximately plus aftd minus one standard 
deviation in the Rights or score distribu- 
tion, the middle three stanines (roughly 
54% in, a reasonably symmetrical distribu- 
tion) Is also another way of selecting the 
place where R/A is at its best. 

It is worth taking note of, however, 
for anyone wfio made an appreciable score 
above the average chance score, Especially 
if the number of Wrongs is large. It cer- 
tainly does indicate a temperamental tenden- 
cy toward or away from random marking. 1./ 

We cannot leave this matter without 
considering what the results obtained signi- 
fy in terms of test- taking behavior. Itr + 
would seem obvious from the above that we s 
must? build into the Directions for Adminis - 
tering a strict admonition not to respond on 
a purely guessing basis, since this ^rkncjom- 
ness in the score distribution wil'f orvly re- 
sult in diluting the correlations between 
the before-after scores. 



1/ It will be further noted that the R/A ra- 
tio equals the item difficulty when all * 
items are attempted, ^'moment's thought 
makes this per feet ly 'logical. In other 
words, to , obtain intern difficulties Rights 
are divided by all the total possible 
scores, which^in'ef feet is what happens 
when there are' no omits . V 



Table 11-10 



Means and Standard Deviation^ - Rights /Attempts 



.RANDOM SAMPLE 



Test 
Word Hearing 
♦ Paragraph Meaning 

Arithmetic Computation 
'Arithmetic^ Goriceptzs^ 
-^,Ap|rthmetic^pplications 



Fall 
Mean S . D . 



Spring. 
Mean' S.D. 



.58 
.54 
.45 
.45 
.4* 



.213 

.174' 

.132 

.173 

.178 



.65 
.61 
.59 
.53 
.51 



.181 
.181 
.207 
.190 
.L95- 



/ 
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% v(lt may Jb$? one of the- major reasons why 
we 'got such peculiar correlations when we 
originally &ttemp£ed to correlate Rights 
^-versus Attempts, as reported earlier.) 

// 

"This applies even if the tests adminis- 
tered are taken a full year apart and are 
.both different forms and different levels. 
' Random guessing Always reduces the correla- 
tion between pairs of scores . 

. r • 

Trans forming the R/A Ratios Into Stanines * 

Although the dis tributions* of scores 
show that these ratios generally are symmet- 
rical and more or less bell-shaped, they 
certainly -are not directly comparable to any 
•of the otkrer data that we thave. 

Since we do have stanines for most or" 
the other data - e.g., Rights, Wrongs 1/ , 
etc. - that might be of value to the teach* 
er, we used the data from the cumulative 
percentages on the Normal Percentile Charts 
to lay off stanine values for R/A and read 
off the stanine ranges. 

Univariate distributions" of stanines, 
showing the.n also graphically by means of" 
asterisks, are shown in Figure II-6 for the 
fall and in Figure II-7 for the spring.: 

We need only to call your attention to 
the fact that the stanines did, indeed, fit 
the distributions remarkably well. (Compare 
theoretical' with obtained frequencies.) The 
resulting stanine distributions are symmet- 
rical, as of course they should be for sta- 
nines - which are, after all^ normalized 
standard scores.. 

If the teacher has occasion to profile 
pupil resuj-ts , these^s tanines .are entirely 
appropriate for. profiling purpbses against 
any other , set of data expressed in stanine 
fo'rm and' based on essentially the same popu- 
lation . \ * 



J-/ Available on request. 

> ■ 



.For correlation, purposes, ,however, they 
are of less value - Sdrice the R/A ratio has 
a built-in correlation with Rights and 
Wrongs because the denominator -of.£he ratio 
is the sum, of these two. Lntercorrelations 
of silbjects : might work out well. 



. When Is a Test Invalidated By Gues sing ? 

How high must the R/A score be to jus- 
tify considering the test, invalid? The only 
real way to resolve this problem is to exam- 
ine the complete profile of the child, in- 
cluding a visual evaluation of his scores as 
Trusted on the roster. 

•v \ 

In instances where we see a rather sub- 
stantial run of Rights at the beginning of 
the test, we can ,more or less conclude that 
the\ahild knew those particular items/,* This 
pattern of Rights will gradually break down 
as t th£. .items become harder, or in some cases 
it wil^suddenly break down, and the child 
-eitherV^oes int?t> a full guessing pattern or 

\ tops -% ■ { 

Perhaps we should solve tshis dilemma by 
\ con^siderir^- "no test" any instance where a. 
';ch4-£d has*a % Rights/Attempts ra[tio of .50 or 
desai A re^ily satisfactory ratio should be 
:'75 or higher^ y>ut apparently both teachers 
and pupils need 'much more unders tanding be- 
fore ^such 3. tiigfi, standard dan be implemented. 

% '\ J * ] - . , 

(A* high ratio means a large proportion 
of items Attempted were* answered correctly.) 

Children who are obviously guessing 
should be excluded from the item .analysis:. 
and the N reduced by 1 for every -Such , case f , , 
eliminated in computing the difficulty 
values. - " , 

Any formula that can be devised to al- 
low for guessing will work a hardship for 
some-individuals . " . 4 

- For eltample, we have advocated the gen- 
eral^ thesis that a test should be relatively 
difficult at the beginning of the period of 
instruction in order to allow for plenty of 
room for the individual to indicate' a real, 
gain during the period of time he is subse- 
quently under instruction, provided (1) that 
he is subsequently "exposed to instruction, 
and (2) he is believed to have reached <a 
level of mental development to permit learn- 
ing the content in /question. 

m 
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Sta- 
nine 



R/A 
Range 



Frequency 
Act. Theor. 



■ Percent 
Act .> TheQr.'.\ 



.WOKD MEANING : 




•* 






9 .94-1.00 


2Z 


23' 


4 


4 


8 .86-/93 


36 


37 


6 




7 .75-. 85 


75 


68 


" 13 




6 .67-. 74 


94 


99 


17 




5 .54-. 66 


108 


112 


19 


4 20, 


4 .39^3 


97 


' 99 


17 


•47 


3 -.29-. 38 


V 72 


68 


13 


12 


2 .22-. 28 


O 35 


.37 


' 6 


7- 


1 .08-*.21 


26 


23 


5 


4 


Median=.60 x 


555 









**jrk* > 
*****+jf* 

******************* 
*****?<****************** 
*************************** 
************************ 
****************** - 
********** 
****** 



.PARAGRAPH MEANING : 



^9- 


85- 


.96 


23* 


23 


4 


4 


****** 


8 


Uj> 


,-84 


36 


37 


6 


7 


********* 


7 


70- 


.76 


74 


68 


13 


12 


****************** 


6 


62- 


.69 


93 


99 


16 


17 


*********************** 


5 


50- 


.61 


109 


111 


19 


20 


*************************** 


4 


40- 


.49 


104 


99 ' 


18 


17 


**** * *** *** ***** ********** 


3 • . 


31- 


.39 


65 


68'. 


12 


12 


**** ** ** ** ** **** 


2 


25- 


.30 


37 


37 


7 


7 


********* 


1 


07- 


.24 


23 


23 




4 


******** 


Median 


= .5 


4 ■-. 


555 , 











ARITHMETIC COMPUTATION: 



9 


81-1.00 


24 


' 23 


4 


4 


****** 


8 


72-/80 


37 


37 


7 


' 7 


* irk ****** ** * 


7 


60-. 71 


68 


68 


12 


12 


****************** * <: , J 


& . 


48-. 59 


98 


99 


17 


17 


****i<* ********** *******!; 


5 


37-. 47 


113, 


112 


20 


20 


**************************** 


4 


28-. 36 


105 


99 


18 


17 


************************** 


3 


23-. 27 


62 


68 


- J 1 


12 


*****irk******** 


2 


16-. 22 


37 


37 


7 


f 


********* 


1 


06-. 15 


4 22 


23 


4 


4 


***** 


Median 


= .42 


555 











ARITHMETIC CONCEPTS: 








9 .76-1.00 


22 _ 


23 




(\ 


8 '.68-. 75 


38 


37 


7 


1 


•7 .59-. 67 


69 


68 


12 


12 


6 . .49-. 58 


LOO 


. 99 


n 


17 


5 .39-. 48 


10* 


112 - 


19 


20 


4 ■ .31-. 38 


100 


99 


18 ' 


17 


3' .23-. 30 


66 


68 


11 


12 


2 .19-. 22 


-43 


37 


6 


7 


1 .06-.18- 


21 


23 


4 


4 


Median=.42 


555 









-ARITHMETIC APPLICATIONS: 



9 


.77-1.00 ■ 


25 


23 


4 




8 


.68-. 76 


34 


■ 37^. 


6 


'r 


' • 7 


.60-. 67 


71 


68 


13 


. ' 12 




.49-. 59 


10' 1 


98 


18 


17 


* 5 


.'39-. 48 


109 


111 


19 


S20 


'• 4 


.30-. 38 


'96 


• 98 ■' • 


17 


* 17 


' 3 


.24-. 29 


69 


68 


12 


212 


2 , 


.18-. 23, 


36 


- 37 


■'6 


* 7 . 


;< 1 


. .06-. 17; i 


22 


. 23- 


'-4 





Median= .44 
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<figur¥ II -6 
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***********************>;**** 
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* * * * * * ** * * * ** * * * 

* V * .V * * *** ** 
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***************-!<** 

********* V ****** ********** 

********** ** ***i<\'r** ******** 

************************ 

****** *********** ' \ ' 

********* 

***** 

(Each * represents 4 students-) 



Stanine^ Distributions -' Index of Guessing or Rights /Attempts 
* RANDOM SAMPLE - Fall 1969 
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Sta- 


R/A , 


Frequency 


Percent 




nine 


Range 


Ac t . 


Theor . 


Act. 


Theor, . 




WORD 


MEANING: 










******* 


9 


. 92-1.00' 


23 


23 , 


4 


4 


'3 


.87- .91 


-T U 


^7 


7 


7 


********** 


7 


.82- .8-6 


62 


. 68 


li 


1 12 . 


*************** 


6 


.72- .81 


102 


99 


18 


17 


**"* ********************** 


. 5 


\63- .£1 


121 


112 


21 


20 


****************************** 


i' 4 


.53- $2 


94 


99 


11 


17 


*****)****** ******* ***** 


3 


.40-.. 52 


60 


68 


11 


12 


**************** ' , 


2- 


rr32- .39 


40 


37 


7 


7 


* * * * ****** 


1 


A)- .31 


23 


23 


4 


4 


****** \ v 


>teditr.v=.67 


555 










PARAGRAPH MEANING: 










9 


.88- .98 


24 


' 23 


4 


4 


****** 


8 


.8^- .87 


31 


• 37 4 


5 


7 


* * v * ** * * * 


7 


. / 7 - . 8 2 


72 


69 


13 


12 


****************** 




AS- .76 


101 


99 


18 


17 


************************* , 


5 


.59- . n / 


LI 1 


112 


20 


20 


**************************** 




. -to- . 58 


97 


99 


17 


17 


************************ 


j 


. J ) - . -* ; 


7 -4 


6 9 


13 


12 


****************** 




. J > - . U 


3 7 


3 7 


6 


7 


JU . u juju «iu j- 




. 1 >- J / 


20 


2 3 


4 


4 






um= . m 2 


557 








***** % 



ARITHME L'K COMPUTATION : 



9 .91-. 97 


23 


23 


4 


, 4 


****** 


8 .86-. 90 


. 31 


37 


5 


7 


y f j- ju y f 


7 .77-. 85 
6 .68-. 76 


73 
94 


68 
99 


13 
17 


12 

17* 


****************** ' * 
******* Vf**********^**** 


5 .54-. 67 


124 


111 


22 


20 


******************************* 


.4 .41-. 53 


93 


99 


16 


17 


*********************** 


3 .30-. 40 


67 


68 


12 


12 


***************** 


2 .23-. 29 


36 


37 


6 


- 7 


********* 1 ^ 


1 .10-. 22 


23 


23 * 


4 


4* 


****** . „ 


Median=.60 


5^? 


«* 









ARITHMETIC CONCEPTS : 



9 


.85-. 94 


23 


23 


■ 4 


4 


****** r 


8 


.78-. 84 


42 


37 


. T 


7 


********** 


7 # 


- .69-. 77 


65 


68 


12 


12 


**************** t 


6 ' 


.58-. 68 


97 


93 


17 


17 


j************************ 


5' 


.A8-.57 


105 


• 111 


19 


20 


*************************** 


4 ' 


.38-. 47 * 


1Q3 ' 


98 - 


18 


17 


*********** *************** 


3 


.29-. 37 


60 


68 


11 


n * 


**************** • f 


2 


, .21-. 28 


49 


• 37 


9 


'7 


****j******** • 


1 


v. 13-. 20 


18 


23 


3* 


4 


**** **' ^ 


Medians. 5 2 


551 











ARITHMETIC APPLICATION^ ; 



9 


.83-. 


97 


, 24 


23 . . 


, 4 


4 


****** 


8 


.77-. 


82 


34 


-37 


6 


7 


******** 


7 


.68-. 


76 


62 


68 


11 . 


. 12 


*************** 


•6 


.59-. 


67 


' 91 


98 


' 16. 


17 


*********************** 


5 


.46-. 


58 


122 


111 


22' 


* 20 ' 


****************************** 
************************* 




.34-. 


45 


*100 


98 


18- 


17 


3 


.25-. 


33 


70 


68 


•12 . 


* -"42. 


***************** 


2 


;i9-. 


24 


36 


37 


'6'- 




********* 


1 


.12-. 


18 


24 


23 * 


'4- 




****** ' 


Med4an=.52 


353* 






(Each * represents 4 students) 



Stanine Distributions - Index of Guessing or Rights /Attempts 
» RANDOM SAMPLE - Spring 1970; 
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USE OF PREDICTED SCORE TO DETERMINE 
EXTENT OF GUESSING 

•Some years ago (1950 approximately) , 
this writer devised a technique for estimat- 
ing an untimed score from a time- limited 
score - after observing that some children, 
particularly the slow learners, were handi- 
capped because the time limits (normally 
quite satisfactory) were, for them, unduly 
short. ' 

This technique is expressed in the for- 
mula: Untimed Score = A + (B/C x D),' where * 
A = the 'score earned within the stated time ' 
limits to the beginning qf the' series of 
omitted (hard) items ; 
B « the score earned on the last twenty 
items attempted; 

the sum of the percent passing the last 
twenty items attempted; and 
the sum of the perjfcen,t passing the re- - 
maining (i.e., not attempted) i, terns . 

/ 

• A closer look at this formula makes it 
# obvious that the value "C" is actually the 
mean score earned by the population in ques- 
tion on a subset of twenty items IF THESE 
ALWAYS ARE THE SAME TWENTY ITEMS - since the 
- .sum of the difficulty values of any group of 
items taken by a defined population, with 
the decimal point retained for each percent, 
v is .the mean score for that population. Sim- 
ilarly, "D" is the mean score of the items 
not attempted. 



C - 



D - 



The precise effect of the application 
of this formula, under the condition stated, . . 

£ is to estimate a score for the items not at- 
tempted by saying that it would be .some pa- 
rameter of the mean score for/the selected 
twenty items , dependent upon the' "goodness" 
of % the performance of the individual on the 

( chosen subset of twenty iteins - as Indicated 

^by the ratio B/C. - 

N6te: The subset of items used in the 
original application of the formula was the «■ 
last twenty items attempted by* each , indivi- 
dual. Just what subset of items* was. used 
for Student A or B or C was irrelevant, pro- 
vided that it gave a good estimate of the 
■ ability of the individual to answer the, 
questions contained within the final omitted 
items. All twenty items were supposed to be - 
"attempted" in the sense that the individual 
'had carefully considered each. Omits' in 
small numbers were permitted. If the "DK"' 
(Don't Know) space was available, this^cpuld 
be used to a reasonable extent. The score 
earned, except for the minimized effects of 
chance, was probably the optimum estimate of 
the quality of work the individual was capa- 
ble of doing on the test, allowing for unre- 
liability and the failure of the individual \ 
to attempt all items. ^$ ^ 



While this formula had the virtue s>£ 
correcting the individual's score so that' it 
gave a reasonably- close estimate of what he 
was capable o.f doing," It would underestimate 
in raos-tvinstanees the score earned by the 
very ablest individuals. This was true be- 
cra"ifee difficulty values for "D" were easy 
for the ablest students but (difficult for . 
the average or low achievers.^ Tt\is was not 
a serious limitation , however , because. these 
very able individuals almost always did alt 
t they were capable of doing i'n the time al-" 
lowed and time, thus, was itself not a fac- 
tor.' 1/ 

3 In thig present situation, the analysis 
of the performance of the "individuals com- ' 
prising the random sample from t.he total New 
Hampshire state population at the fourth^ 
grade level indicated that a very much Jterg- 
er proportion of individuals answeted jaffere 
of the items correctly than one would antic- 
ipate if guessing were not present as a com- 
mon practice. 
* <• 

To put this differently, one would an- , 
ticipate that both the rights score and the / 
attempts score would be more or less normal- 
ly distributed if the items were answered on 
the basis of true information or knowledge 
or on the basis of a rational-aria lysis of 
the alternatives - with the final ^choice be- 
ing made on the basis of some knowledge, if 
not total knowledge. The rights score and 
the attempts score would correlate highly. 
It has' been shown that this was not the case 
for the New Hampshire random sample on any 
of the five tests analyzed. i 

In search of 'some additional light on 
this subject fpl lowing the analysis of the 
distributions of scores for the "attempt 
all" population, the writer has adapted the 
formula described above for the estimation 
of a total score ori a test - using as the 
basis for the estimate the first twenty 
items in each test (in Paragraph Meaning, 
23.), which constitute the easy items. 

It is felt that guessing tendency would k 
be minimized in answering this subset of / 
items , because a much, larger number of indi- 
viduals would know the answers to a very 
substantial proportion of the items selected 
and would, therefore, not be likely to re-* 
sort to guessing. 



\l While this procedure has not been pub- 
lished, it was first described in 1949-50 
and a nomograph, plus cumulative sums of 
percents passing, facilitated the free 
choice of any subset of twenty items as in- 
dicated. 
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Note: Items were arranged in order of 
difficulty on the regularly published edi- 
tion of Stanford: Intermediate I: Form X. 
If the difficulty values oif these easy, or 
at least easier , items is used constantly as 
the basis for the estimate 1 of total s^ore, 
it is possible to arrive at ,an estimate for 
individuals of varying levels^ of ability and 
compare this with the s.core they actually 
earned : 

In this procedure, the values- in "A" 
and "B" are always based upon the first 
twenty ite-ns (23 for Paragraph Meaning so as 
to include all questions on the , last. para- 
graph in which the twentieth item occurred) 

Thus, the original formula is modified 
to:- Predicted Score * A f (A ! /C x D), where 
A a the score earned on the first twenty (or 

2 3) items ; * 
A ! = the same value; 

C « the mean score of this population on the 

first twenty (or 23) items; 
D = the mean score of this population "on the 

regaining items - e.g., the last' 18 

items 0\ Word Meaning. 

In making the dec is ion* to use the first 
twenty (or 23) items as a -'constant in this 
study, two factors were involved. First, 
guessing certainly would be minimized by 
using the very easiest items; secondly,* the 
difficulty range of such items must yield 
enough variation of score to be reasonably 
reliable. 

This estimation process, when done by, 1 
'* ^hahd , proved to Be a time- consuming task. 
In this report for the random sample we'will 
give the results for three tests only; name- 
. ly, Word Meaning, Paragraph Meaning, and 
Arithmetic Computation . 1/ Word Meaning and 
Paragraph Meaning togetKer constitute Read - 
ing , the subject of greatest concern to Ti- 
tle I programs. 

As^ earlier indicated, the special rea- 
son for concentrating on these subtests was 
the <>bvious .difference in the relative envi- 
ronmental impact. Word Meaning and Para- 
graph Meaning are greatly affected by the % 
total environment; Arithmetic Computation is 
almost exclusively school oriented. 

It is well known that generally, in^ 
standardized tests, great care is taken In 



ll Work was completed on the remaining tests 
after this manuscript was completed, and 
the results' are given in Appendix C. ' 
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the final forms of a test to arrange the 
items in order of difficulty, 

There are variou^ reasons for this, 
some statistical and some psychological. It 
would be psychologically' unwise, for exam- 
ple, to* begin a test with an extremely dif- 
ficult item; this would immediately discour- 
age the child in taking the* rest of the 
test. Even a chance arrangement of items 
with respect to difficulty would have much 
the same effect. 

On the other hand, by arranging the 
items ' in' order of difficulty (easiest first) 
ev6ry child would be encouraged to "do what 
he- is capable of doing. This arrangement 
has the additional advantage that it makes 
the time limit of much less importance,? 
since in almost every instance a child will 
get all of 'the items right that he can hon- 
estly answer correctly in the time allowed , 
even if he does not attempt all of the 
items * 

This has been shown repeatedly in ex- 
periments to determine the effective working 
time limits, which are so necessary for the 

practical administration of a test. 

i 

It is important_to Jjpint put also that 
it is standard operating procedure~to bul?.d 
into a ' standardized test a wide enough range 
\of difficulty to provide for both Jthe least 
_ and the most able pupils within the 

group; to be tested, obviously a necessity in 
t a - survey test .. 

i t. : > - .i - I 

"Usually this .is accomplished at the 
"lower end of the scale by including some 
items that should have been learned at a 
grade previous to^that at which the test is 
normally given. 

> Ideally, in any local before-after pro- 
gram all the items should be validated 
against the local curriculum; but this is 
rarely. done, unfortunately. 

The upper levels are provided for by 
making items of greater difficulty while 
still staying within the curriculum normalLy 
found within the grade or grades for which 
the test is intended^ i.a. t to avoid a* much 
a* po66ihle. Including any it&rr* to which the. 
target Qfiot\p hat, not been exposed to in- 
struction. 

Consider the difference in the diffi- 
culty of adding two three -place- numbers and 
ten three-place numbers . In the. first in- 
stance', the opportunities for error are 
fairly limited; while in the second, the op- 
portunities for making errors are greatly 
increased because of the number of times in 
which an individual must perform the basic 
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operations involved in solving a problem of 
this nature. 

* Both types, however, involve exposure 
to the basic protlem of complying with lo- 
cally made "behavioral objectives" - which, 
far too often, are rather inadequately con- 
ceived. 1/ 

The analysis that follows has divided 
the total random sample into subpopulations 
of boys and girls, and each of these into 
those who Attempted-All items versus those 
who Dig^NOT-Attempt-All items . / 

The selection of- the Attempted-All 
group as a way of designating the guessing 
group is an impure or contaminated way of 
identifying guessers, because some who At- 
tempted-All items did not guess in a rancfom 
fashion at all, or did so on tbe very few 
items at the end of the test. 1 

These latter are the very able chil- 
dren, who naturally earn high or near per- 
fect scores because they know ' the answers or 
can arrive at them rationally by thinking 
about the alternatives offered and choosing 
one of two alternatives . 

The technique of predicting a score and 
comparing the predicted score with the actu- 
ally earned score works besft to identify the 
guessers where the "attempted', 1 count is sub- 
stantially higher than the earned score. 
The greatest difference, Obviously, is to 1>e 
found where all -items have&been attempted 
and few are right. (This has been previous- 
ly considered in discussing the R/A ratio.) 
# 

The correlations reported in this study 
have been arrived at by actually plotting 
the data on bivariate charts v Computerized 
calculation would have been much faster and 
possibly more accurate. 

Correlation coefficients alone can be 
very misleading, especially the coefficient 
without the corresponding plot. For exam- 
ple , a correlation coefficient does not as a 
general rule reflect gain in score but pim- 
ply expresses ranjc order. 



1/ This may be conceived a? a criticism of . 
the local "curriculum objectives commit- 
tee" (by whatever name) , but it is not so 

- intended. Realistic "behavipral objec- 
tives" are time-consuming and difficult 
to prepare, especially if one keeps in 
mind the subsequent need to evaluate suc- 
cess or failure. Nebulous objects defy 
evaluation! 



Simplified computational formulas were * 
chosen to make it possible to obtain the 
correlations manually from the plotted 
charts with' a minimum of work and to check 
the r f s by the use of several different for- 
mulas, all derived from sums and differences 
of scores . 

This computational process also yielded 
means and st^idard deviations, which are 
helpful in studying changes in magnitude and' 
variability of scores. 

Hie correlations reported are listed in 

Table 11-11, so that someone can see at a 
glance the rather substantial number of .pop- 
ulations separately studied and the general 
trend in the r ! s for different subsamples. 

The number of cases, the means, and the 
standard deviations are reported separately 
in Table 11-12, immediately following. 

No attempt is made to evaluate statis^ 
tically the differences between the means of 
"the Attempted-All group and the Did-NOT At - 
tempt-All grp,up~because we are not dealing 
witn purely^andom samples and we had no 
reason to anticipate, without investigation, 
that the distributions on which the correla- 
tions were based even were normal or simi- 
larly skjrwed, so as to provide a rectilinear 
plot . 

Considering first the correlations 
alone (Table 11-11), it is noteworthy that 
they are high in practically lever y compari- 
son; as^a matter of fact they are surpris- 
ingly high, all things considered. 

, qne might even conclude that the first 
twenty items on a test give about as good a 
measure as the totals test, at least for the 
tests considered»._2/ This would not, of 
course, be true - because such a procedure 
does not take account of the range of per- 
forming ability on the whole test for the , 
group from which the New Hampshire item dif- 
ficulties were derived; i.e. , the random^ 
sample of pupils tested at grade 4 in 1969- 

* If we were to consider the distribution 
of scores for the first twenty itenis only, 
we'd find many of the at lest children piling 
"■up at the top score of 20>and our predicted 
score would be (and\is) too low. The pre- 
dicting formula helps, but it still fails to 
do justice to the very ablest children - 
whose predicted scores regularly fall below 
their earned *s,cores,. 
>. 

* « .* - 

2/ In some tfgses they approach or exceed re- 
ported reliability coefficients. 
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Table 11-11 
Correlation Coefficients 
Actual Scores versus Predicted Scores 
RANDOM SAMPLE 



Attempted-All 
Fall Spring 



Wbrd Meaning 
B'oys 
Girls a 

Paragraph Meaning J 
Boys 
Girls 

Arithmetic Computation 
Boys 

Girls * 



.94 
.95 



.90 
.90 



.91 

.89 



.92 
.88 



.90 
.91 



.92 
.89 



Did-NOT-Att . -All 
Fall Spring 



.93 
.92 



.81 
.80 



.92 
.92 



.89 
.90 



.78 
.73 



.89 
.89 



Total Group 
Fall Spring 



.92 
.90 



.79 
,77 



.89* 
.89 



.89 
.89 



.82 
.81 



.87 
.88 



Next* it sbfould be noted that the corre- 
lations for the, Attempted-All and Did-NOT- 
Attempt-All groups combined (i .e . , the Total 
Group ) tend to drop slightly, but only a 
point ^or two ir( the hundredths place. Most 
readers would (regard such small differences 
as being practically insignificant'. 

A bsolute Changes in Score Over a Given Time 

In order to establish the amount by 
which individuals change their status J^y' 
gaining additional points of score over the 
intervening seven months, one must look at n 
the data in the table of means and standard 
deviations* (Table 11-12), given separately 
to avoid clouding the issue-of the level of 
agreement of acdual versus predicted scores. 
These are very important data, however, and 
need careful study. 
« 

In Table 11-12 we have summarized a 
very large amount of data in what would be 
called a general purpose table; that is, one 
that presents far more data than can be ef- 
ficiently discussed in detail in the cext. 
Thus it' presents the reader with a challenge 
to search the table for meanings not specif- 
ically brought out in the discussion. 

Most Significant Elements in Table 11-12 • 

\ • 

The table contains data relevant to the 
actual or recorded score (that is, number " 

... * -37 



right) - first, for those children who At- 
temp ted-All items; secondly, for those wEo ttk 
Did - NOT - A tt emp t - Al 1 ; and finally, for Total 
Group . These data are given separately for 
fall and spring as well as for boys and 
girls, together with numbers of cases in 
each subgroup. , » 

We have then abided to this table compa- 
rable data for the predicted scores as for 
actual or reported scores. 

We have "given, finally, the differences 
between means for both' the actual raw score 
earned (i.e., the number right as scored by 
the machine) and the predicted score sepa- 
rately for the Attempted-All group versus * 
those who Did-NtiT-Atternpt-All and the At- 
tempted-All group versus the Totals-Group of ( 
boys or girls. . 

.(Note:- In the latter case, the Total 
Group data includes both of the previous 
subsamples ; thus in a -sense this column, 
then, is diluted by the inclusion of the At- . 
tempted.-All group. In t point of fact this- 
final comparison does, however, indicate the 
effect of the inclusion of the Attempted-All 
subgroup data on the total results as- previ- 
ously reported to the community. 

Thus it highlights the fact that 
these children do constitute a .separate sub- 
population, distinct in character from the 
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Table 11*12 

Comparison of Attempted-All versus Did-NOT-Attempt-All Group's for Selected Statistics 
With Particular Emphasis on Magnitude and Direction of" Differences Between Means 

RANDOM SAMPLE 



Word Meaning 
BOYS 

Fall - Actual Score * 
Predicted Score 

Spring-Actual Score ' 
Predicted Score 

GIRLS 
Fall - Actual Score 

Predicted Score 



Attempted-All 
N Mean S.D. 



Did-NOT-Att . -All 
N Mean S.D. 
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19.9 
16.8 



136 23*2 
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Spring-Actual Score 

Predicted Score iJJ 23.7 



19.4 
16.5 

25.3 



8.2 
6.8 

7.2 
5.9 

7.2 
6.2 

6,1 
4.7 



Paragraph Meaning 
BOYS 

Fall - Actual Score 

Predicted Score 

Spring-Actual' Score 

Predicted Score 

GIRLS 
Fall - Actual Score 
v Predicted^Scof e 

Spring-Actual Score 

Predicted Score 

Ar/ithmetic Computation 
VBOYS 

Fall - Actual Score-, 
"^.-.-T/''- Predicted Score 

Spring-Actual Score 

.Predicted Score 

GIRLS ' 
Fall - Actual Score 

Predicted Score 

\Spring-Actual Score 
"'• v > Predicted Score 



66 



138 



66 



141 



72 



113 



59 



94 



12.7 
10.9 

20.1 
X8.9 



4.6 
4.7 

6.9 
5.2 



222 
148 

219 
147 



14.6 
15.5 

18.5 
20.3 



15. 
15 

19 

20 



7.1 
6.6 

6.5 
5.9 

6.1 
6.0 

6.3 
5.6 



221 



1&7 



11.5 
12.2 

18.6 
,19.3 



4.6 
4.5 

6^.9 
5.7 



Diff .of < 
Means!/ 



+5.3 
+1.3 

+6.7 
+2.9 

+4.4 
+ .6 

+6.1 
+3.2 



27 


.5 


10 


.4 




22.4 


9. 


3 


+5 


.1 


.22 


.3 


• 8 


.5 


217 


24., 0 


8. 


2 


-1 


.7 


33 


.3 


11 


.8 




29.4 


9. 


'5 


+3 


79 


31 


.0 


10 


.2 


147 


32.5 


8. 


0 * 


-1 


.5 


27 


.1 


8 


.6 




24.4 


9. 


2 


+2 


.7 


21 


.9 


7 


.5 


215 


25.9 


6. 


9 


-4 


.0 


34 


.6 


10 


.7 




30 :'s 


8. 


8 


+4 


.1 


32 


.2 


8 


.9 


141 


32.7 


7. 


2 




.5 


12 


4 


4 


.9 




10.8 


4. 


1 


+1 


.6 


10 


6 


4 


.9 


213 


11.3 


4. 


8 




.7 


18 


7 


7 


.8 


170 


16.9 


6. 


2 


+1 


.8 


16 


2 


6 


.5 


17.9 


5. 


7 


-1 


.7 



+1.2 
-1.3 

+1.5 
- .4 



Total Group* -Diff. of 
N Mean S.D. Meahsj/ 



284 



284 



281 



280 













15 


8 


7 


v'7 


+4.1 


15 


8 


. 6 


.7 


+1.3 


21 


7 


7 


.6 


+3.5 


21 


7 


6 


.1 


+1.5 


16 


0 


6 


.6 


+3.4 


16 


1 


6 


.0 


+ .4 


22 


1 


6 


9 


+3.2 
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25.1. 9.1, 
281 24/9 7.2 

9 a 0 3,2.6 10.0 
l * 1 3^2.4 8.1 
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280 
281 
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4. 


6 ' 
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6 


-1 
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19.1 


6. 


9 


+1 


0 


19.1 


5. 


5 




2 



* Bivariate Overlays showing displacement of "At'fcempted-All" group^versus 

"Did-NOT-Attempt-All 11 group shown in Appendix for Word Meaning a^id Arithmetic Comp . 

1/ "Attested -All" group versus "Did-NOT-Attempt-All" group 

21 "Attem^te^All" group versus "Total "Group? 
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rest, which unwittingly has affected, the 
performance of the total group because of 
the very significant difference as to their 
method of marking the answer sheet.) 



At this' -point, sin 
the gains from fall to 
random sample group els 
centrate mainly on the 
tempted-All versus the 
groups . % 9 

(Only three of the 
analyzed in this report 
11-12. Arithmetic Cone 
Applications , have been 
enough to see that thei 
tent with the others.)! 



ce we have considered 
spring for the tptal 
ewhere, let us con- 
data for "the At - 
Did-NOT-Attempt-All 



five tests generally 
are given in Table 
epts and Arithmetic 
examined closely 
r results are cons is- 



/ 



First notice that-for all subgroups on 
all tests the differences between means of 
a ctual scores favor the Attempted-All group. 
This is true even when we consider the At- 
tempted-All group in comparison with tha~To- 
tal Group , of which they are a part. In 
Word Meaning, this is true of predicted 
scores as well. 

When we move on to Paragraph Meaning 
and" Arithmetic Computation, we see negative 
, differences between, means for all predicted 
scores are higher for the Did-NOT-Attempt- 
All group and , the Total Group than for the 
Attempted-All group. Out of sixteen compar- 
isons , all are negative. 

Note that in the table the negative 
differences always apply to tile predicted 
score, not the actual, score ; i.e., the num- 
ber right. The significance of this is that 
early performance predicts lower total 
scores far" the At,tetnptBd-Al 1 group than for 
those who Did-NO^Atfcempt-ATl or the Total 
Group . ~~ r r t7\ ' ■~ r 

The question remain,s , however, (and 
must remain unanswered in this report) "as to 
which of the sets of scores ,' actual versus 
predicted, is the more valid measure of 
group or individual performance. 

The writer ! s guess is 'tha£ the predict- 
ed score truly represents the performance of ■ 
a child more adequately than actual score 
when his earned score (i.e., the number of "** 
items answered correctly as scored) is on 

1/ As time permits, Arithmetic Concepts and 
Applications will be completed; but- the 
three tests shown were enough, to demon- .* 
strate the essential fact with respect to 
the uniqueness of the Attempted-All sub- 
group. Costs and time led to the deci- 
sion to omit the remaining two tests for* 
the moment £ro» the random sample analy- 
sis. 



the low side. High scores (i.e. ,,70% to 757* 
right) are an exception .almost by definition 
on a standardized survey-type test. Other- 
wise, the ablest children would not be mea- 
sured! 

> 

Reproduction of the bivariate charts 
for. this report presented a very difficult 
problem. Separate Attempted^All versus Did- 
NOT-Attempt-All bivariates illustrate clear- 
iy the effect of guessing, but it was hard 
to compare two charts. 

Sample Bivariate Charts 

Actually, each bivariate chart as shown 
5 in Appendix B consists of two bivariates 
combined; one for those who Attempted-All of 
the items superimposed on the chart for 
those who Did-NOT-Attempt-All items. 

The Attempted-All group is printed in a 
contrasting color so that one can see the 
change in the distribution from group to 
group, which always is in the direction of 
higher overall performance for those who At- 
tempted-All items '^nd, therefore, took ad~ 
vantage of every opportunity to guess. 
is -this spurious gain due to guessing which 
must be identified and eliminated .to make f ' 
the test truly valid .2/ 

.Perhaps these bivariate distributions, 
frow a layman's poinf of view, are the most ^ 
significant or convincing evidence of the 
presence and effect of guessing,. 

It would have been very desirable to 
reproduce in the report the entire 96 bivar- 
iates from which the correlations were com- 
puted. This consists-^f 36 such charts for 
the random sample group, with which we are 
presently concerned, and 60 for Title I, 
which will' be the concern of the next sec- 
tidn. However, this was impractical Jfrom a 
^ space point of view anci , therefore, only a- 
'selection of these have been reproduced. 



The cons 
tempted-All s 
these charts . 
black versus 
a correlation 
for all cases 
of the At temp 



the raw score 



isfent offsetting of the At - 
ubpopulation is conspicuous on 
* Each bivariate group (i.e., 
colored) taken by itself yields 
mos t of the time higher <?than 
combined . The raw score means 
ted-All group are higher than 
means of the Total Group . 



The greatest spurious gains are for 
those who EARN 1 low scores*, thus, guessing 
hurts most those childrenfwho are in great- 
est need of help! 



2/ Bivariate charts are shown separately for 
boys and girls and for two subjects only, 
Word Meaning and Arithmetic Computation. 
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We must conclude, therefore, from these 
data taken as a whole, that guessing, cer- 
tainly in the sense of marking every. item in 
the test regardless of whether you ki>ow the 
answer or not, generally does have* the ef- 
fect pf raising one's apparent score and, 
therefore, getting a higher percentile rank 
or grade equivalent. 

4 *«. * * 

* Therefore guessing misleads the teacher 
as to whaf the individual really knpws. 
When the score data are reduced to item 
analysis information, as they have been in 
this study, such contamination has a very 
significantly detrimental effect. 

As we continue our inspection of the 
bivariate charts, we must note that all of 
the changes found from the actually earned 
score to the predicted score are not always 
in the direction of an increase in predicted 
score.. This is in support of what we said 



earlier - that this is an 'impure or contami- 
nated way of iderttifying guessing youngsters. 

One of the factors responsible for a 
drop in the predicted score as compared to \ 
the actual, earned score is that .some of the 
items remaining after thje- first twenty items 
were scored .were too hard for all but the 
most able children in the population. Thus 
the* very able pupils actually did earn 
scores higher than their predicted scores 
(and some earned^a nearly perfect scoire) . 

Most significant, however, is the fact 
that the iptercorrelation betwepn purely 
guessed scored is zero and wilL,vary from 
th is value only by chance. 1/ Partial guess- 
ing either in fall or spring reduces all 
correlations in a manner proportionate to 
the amount of guessing. 

1/ Not demonstrated here. See pageII-5. 
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EVALUATING PUPIL PERFORMANCE FROM THE PUPIL 
ROSTERS \ 

The basic ^Kta for this study consisted 
of a listing oiflfce response of each pupil 
to each item - not by rights, wrongs, or 
omits but actually by tabulating the number 
of the alternative chosen *by each pupil (1, 
2,3,4, or*5) - and having the tabulator in- 
sert; the correct response for all items af- 
ter every fiffch pupil entry, so that it is 
possible to determine *by reference to this 
key whether a child has answerecf any item 
correctly and, if not, which of the -alterna- 
tives he has chosen. If 

This type of listing is very essential 
for certain aspects of the analysis we have 
done, and in particular the analysis relat- 
ing to the categorization of the items as 
Right in the fall, Right in the spring, etc. 

In the original mode (Figure 11^8), the 
* chief advantage was that it identifies the 
Wrongs by all alternatives otfcer than the 
correct response and indicates distractors 
that are working effectively versus those 
. which appear not to be attractive to anyone 
except on the basis of pure chance. 

It also' has* the great advantage to al- 
low the person constructing a test to spot 
instances where En^keyed response may not 
be correct or where there may be more than 

v one correct response - since in such in- 
stances the number of children choosing a 

• particular option may be out of line with 
what would be expected for the difficulty 

value of the item as a whole. / 

y 

A different approach ^©'reporting item 
analysis data that lists^only Rights, Wrongs 
and Omits makes the examination of the pb- 
pils* responses much easier than the ap- 
proach we have used here. In order to il- 
lustrate this, the' same page of selected 
V-cases is shown in the Right, Vrong, Omit 
^mode. (Figure II -9) 

i 

The item by item comparison of the per- 
formance of each child, fall and spring, 
constitutes the ultimate approach to the 
problem of evaluating the amount of guessing 
present. 

it is Evident from the , rosters that as 
v one moves from left to right across the page 
(that is, from the early litems to the later 
ones), the number of Right responses defi- 
nitely decreases and the number of Wrong or 
Omitted responses increases. 



1/ See pages 9 and 10, Part I. 



Choosing any particular case, it is 
quite evident that the proportion of the 
first twenty items which are answered cor- 
rectly is much greater than the proportion 
of the remaining items in the test (13 in 
the case of Arithmetic Applications , for ex- 
ample)-. This follows from the publisher's 
arrangement of the items in order of diffi- 
culty. 

We know the "average" guessing score 
from the number of alternatives in relation 
to the number of items in the v test, assuming 
random marking. * 

* * * * • 

In the AriTbhmetic Applications Test, 
for example, tyhic/h is a five-choice multiple 
choice-type test with a total of 33 items, 
we wtfuld expect by chance on the average one 
out of each five items marl<ed to be; the cor- 
rect response, also assuming random marking 
totally;, i.e., none marked from sure or par- 
tial knowledge. 

The average guessing score for a test 
of 33 five-choice items would be one-fifth 
of the number of items, or 6+.. 

One can say without equivocation that 
anyone who has a score of only 6, where ther 
number of items answered correctly is scat- 
tered across the listing of items and not 
bunched/ at" the beginning of the test, is 
surely guessing and , the test should be con- 
sidered to be invalid. 

At the other extreme, if the first six 
items were answered correctly and very few 
additional responses were Wrong and most re- 
maining items were Omitted, this would sug-* 
gest poor^performance but little or no 
guess ingr 



The illustration we have chosen to* use 
in this particular section is -taken from one 
page of the Arithmetic Applications pupil - 
roster. This is a five-choice test where' 
one of the responses always 1 is "Not Given." 
"Since "NG" is a scored response, however, it 
is considered the same as the other re- 
sponses . 

To recapitulate the.^obvious , counting 
the number of "R f s" across the sheet to the 
-right gives the total number of Right re- 
sponses, or the individual's score. Simi- 
larly, counting the "W f s" gives the number 
of responses that are incorrect, and count- . 
ing^the number of "0 f s" gives the number of 
items that were Omitted. 

\ In .a comparable fashion, counting the 
"R's" in a particular coTunm for the total 
population gives the number of individuals 
answering. the" item correctly. - 



-41- 



67 



ERLC 



Answer Shee't Study 
m b 





Q 1 * 




t o 


LO 








j 








< 






H 2 
























0 ro 


- 


H 




* eg 


rA 








CO 


AJ 




3. CA 


r\J 








rA 


















vf 




3 


o 








rA 






3 <n 


<F 










■ 




A" ~i 










N 


AJ 




* 
















-4 fA 


*0 








AJ 






'-O 










,«a 




r n 










^j 




AJ LA 


*a 








\j 


AJ 


tn ^ 












'A 


lA 'A 










% 






4" t 


c 












lA "A 










_^ 


♦ 


in aj 


■o. 








1 




fA >f 




.A 


in a 




fA 


tn -a 












>f 


it *t 










—J 

t 


fA » 


M -< 











II 

AO -« O M ^ 



O GO AJ ^ > —l 



O * 
O AJ 
O fA 
fA fA 

^ fA 

* 2* 

, O fA 
O fA 
•J aj 

o f\i 

O fA 

'o m 

o >f 

3 fA 
"A -A 
^ <f 

•a *n 

"A fA 

fA 

f\J -A 



o *t - 

AJ 



3 

3 ^ 

3 sJ" 

O fA 

3 'A 

r eg 

A -< 

n m 

in - 

M AJ 

ja m 

ft - 

in -a 

A -A 

t -4- 

n-n 

a m 

r * 
i 

A fA 



rA rv 

r< aj 



LA 



f,\t >t O fA 

(n a; o r\j 

a aj in 

n n o rA , 



n >r«t m h 



4) - N 

t -A 

fA <n 

N "A 



v N rA 



A A i -ft (n 



a: 

3' 



iL 



n >n 



-» in 
n fA 



vn fA 
in in 
m aj 

Aj'ih 

in r-l 

m — • 

-A AJ 

>t in 

-A ' 
-A fA 
LA X\ 
fA <A 
IT> in 
-A UA 
>t 4* 
*A fA 



op o**n ^A OO OD 



3 rA 



3 ^ 



3 AJ 
D fA 



^r* sj f»fA 



A fA fA 

r ^j aj 

A — ' 

r 

NJ -A AJ 



n iH 

n ^ 



-A 



in aj aj 

in r\j -A 

It * vt 

rA st ca 

fAO 

*t -n >t 

i 

n aj tn 

a m (A 

A fA <t 

A rA ">A 



fA (n o o 

<r -< o o 

tn p- o o 

m n o o 

in n do 

>r t o >r 
m n . o ro 

LA M aj 

i 

if\ <? vn 
j - 

fA H V O fA 

m f\J 

4* (n sj- rA 
n La v >f >f , 

>n N <f 

a kA -a 

>f it (A 



AJ <4 
AJ VJ 



O O 
r-l <M 



n ^ jn >f 

NJ AJ «n AJ 
A AJ_ ' O AJ 

n ^t h v 



t -n aj 'n 



A " n rA 

-a mm 

A *A^ *\ «\ 



3 O I 



rA^; 
Q O 
O O 



3 O 
D O 



O O t 0 CO 



P D 
C O 



r 



>r in 
in fA 

fA }4* (A >t 



*\J A fA At 



I" <f »t 

n aj aj ~< 

M AJ , 'NJ fA 

f -H ~4 ~4 



t 

^ t N AJ 



A >t "A 

ft u> m i 

A "A rA 



3 O 

3 O 



* 



> 



O 

o* to 

O 3 

O D 

O 3 

O D 

O 3 

<o b 

O O 

bjf 

O Ja 
OfA 

o 

o (n_ 

i 

o p 

O KT 
I 

fA*fA = 

i 



fA fA 

aj in 

* >r 

fA -i 

in in 

x\ *n 
rA 

aj in 

^-1 fA 

AJ "A 

in m 

fA fA 

fA *A 



v-r t 
in aj 

in a 

m » 
-t t 
>r n 

M 

fA (n 

lA 

m jn. 

i 

m if 

A 

I 

"-o in" 

AJ fA 

a jn 

(A 

jn 

i 

AJ A 



o o« 



AJ O 
^^Aj <A 

' A^^ AJ 



AJ 
AJ AJ 
H sf 



Aj fA fA 



>f AJ 
fA <H 
fA AJ 

rA >f 
in %r 

AJ AJ 



5 ^ 

N AJ 

fA fA 
* -A j 

tn in ! 



t rA "A -A 

n in >t -h 

A fA ^ ; N 



•H H AJ AJ j X\ A 
O 3 O O i O 3 



in 

fA "-i 

AJ fA 

in x\ 



AJ 
'A 
<f 

AL. 

in 
*t 

fA 

in 

fA 
fA 



o o * o 



in sf 



o <o in a* 



,"A "A 
fA "A 

£ j 
fA 'A 



fA m 

^A AT 

in m 
^ in 

.^t fA 

fA >r 
m in 
^ >f 

fA U\ 

in aj 

fA fH 

N fA 

fA 
>f in 

,"A fA 
rA fA 
•AiA 



O 

& 

2 

! H 
O ~i I K 



1/ 



D -< 

O uA 
fA uA 



rA in 
m "A 

N io 

rvj 

sf sj- 



UJ 

ZL 

a 
o 



aj rA ^ 00 

t i 

! H 

pin"* h 

! a) 

; 

in -H ; ^ 

| h 

in rA J 

I 



<C 

or 

oo 

CO 

>- 
I 

<c 

s: 

UJ 



oo 



m in 

in t» N 



AJ AJ 



1 ~l A 

tn n 

fA 'A 



CO 00 

1-1 

o o 



NJ 
in 



it >f 

H AJ 

I 

fA AJ 

j 

in in 



AJ AJ 

fA r\j 
in la 

sf 



"*aj aj 

I 

f\l fA 

AJ AJ 

it-* 



JC\ 'ft *A "M 1 



t 


»A 


'A A 1 


A 


n 


!ft 


UVft 1 


A m 


A 


fA 


rA "A* 


rA ^ 



UJ 



AJ Aj 
D '3 



^ "A 

m n | 

-A *\ \ 



a. to u. co 



AJ AJ 
AJ N 

o o 



AJ AJ 

o o 



oo 
o 



Q_ 
Q_ 

<n 

<LJ) 

[Z! 



<S> OO 



a ' 
^ UJ 
<c — I 

I SI 

— I <c 

<c op 

o 

cC 



-42- 



ERLC 



68 



Answer Sheet Study - II 



t 



Item No .l 23456789111111111122222222223333 

0 ,1 2 3 4 5*6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 
Pupil F * * . 

No. S * 



TOTALS 
Right Wrong Omit 



001 
002 
004. 
006 
'D08 
Oil 
012* 



F-t* R R R R 
S-R R R R R 

F-W W^W W R 
S-R R R W R 

F-R R W R R 
S*R R R R R 

F-R RRWR 
S-R H R R 

F-R R R R R 
S-R R R W R 

F-RRURR 
S-R R R W W 

F-R R R R W 
S-R R W If R 



015,?.F-R WWRW 
, S-iR R W it R 



. 018 
v020 




F-R RRWR 
S-R;R£ R R 

F-W R W W R 
S;-R R ,R 'W W 

F-R R R R R 
*S-R R R R R 

F-RSR R W R 
S-R R R W R 

f-&r' RW.W 

S-R RRRR 

F-W- W W W W 
S-W.W W^W W 

A; 



027 F-R *-R*W' W W V ;W 



S-R R R W W 



R R W R W 

R R-R R R 

R R W 0 W 

W R W W W 

>< 

W R W R >W 

W R R W R 

R R W W 0 



RRWRR 
W R R W'Tf 

W R R W W 
W R W R R 

R R W W W 
R> R W W W 

R R W W W 
RRRRW 

R R.R R R 
R fC& R W' 

W W W R W< 
R £ W R W 

R RsR R W 
R R R.R R" 

W if W fl w 
W R W R R 

W R W$f? R 
R R U : tf R 

V R W R<W 
R W $JA W 



RWRWW 



w 


R 


R 


W 


R 


W 


R 


R 


R 


W 


R 


R 


0 


.0 


0 


W 


W 


R 


R 


w 


w 


W 


R 


R 


R 


w 


R 


R 


R 


R 


R» R 


R 


R 


R 


R 


W 


w 


R 


R 


Ff' 


W 


? 


R 


R 


R 


R 


W 


W 


R 


W 


R 


W 


R 


R 


R 


R l 


R 


W 


R 


W 


R 


R 


W 


W 


R 


W 


R 


W 


R 


R 


R 



W W R W W 

w'r r w'w 

W W y W R 

W W W R W 



W R R W R 
R, R R R 5 R 

0K £ R W 
R R R-W W 

R R R W R 
R R R R R 

W w w w w 

W W W R W 



R W W W W 
W W W W W 



R W W W 
RRWR 

R R R 0 
R R.R R 

RRRW 
RRRR 



R RxW R W--W W W W W 
R 4 R R R R k n n n " 

VL R R R 0 
0'* R R R 



R R R >R W 

fro 0 0 0 

R R R Ff R 

RRRWW 
R RiR R R 

R R W R W 
R R W W R 

R R R R R 
R R R £ R 

R W R R r W 
R R R R R 
t ■> 
WRRRW 



WWW w~w 
R R JJ W R 



W W W W R 

W W,.W W W 



R R W W R 
RRRWW 

0 0 0 0 0 
R R W W W 

RWRWW 
W R R 

W W W W W 
W R R W W 

RRWWW 
RWRWW 

W W W W W 
W R R W W 

R W 0 0 W 
R^W WWW 

0 0 0 0 0 
R 0 0 0 0 

RWRWW 
R R~W* W W 

0 

R R W,W R 
W W-W W W 

RWRJtfR 
R W R/R W 



R.& WJ W W 
R W R 1 W W 

R W R ! W W 



W R Rfjft W^ £ W 
W R W W 

w w w w w 



W WTf*W R 

wwpw 



f - 



gigure *II-9j> 

DATA FROM FIGURE 1 1-8 



w 


R W W 


0 


0 


0 


0 


16 


13 


4 


w 


W W R 


w 


W R 


w 


23 


10 


0 


0 


0 WW 


R 


u 


0 


0 


8 


10 


15 


w w w w 


R 


W 


w 


w 


15 


18 


0 


W W W 0 


0 


r\ 

U 


0 


0 


16 


12 


5 


w 


R.R W 


R 


R 


w 


w 


24 


9 


0 


R 


RWWW 


R 


w 


w 


13 


19 


1 


w 


RRW 


R 


R 


w 


R 


22 ' 


11 


0 


R 


WWW 


R 


0 


0 


0 


2Q 


10 


3 


W ft R W R 


R 


w 


R 


19 


13 


. 1 ' 

. 


W 


w w w w 


W 


R 


w 


14 


19 


0 


W 


www 


W 


R 


w 


w 


16 


17- 


. 0 


W 


0 0 0 


■0 


0 


0 


0 


10 


14 


9 


R 


W W 0 


0 


0 


b 


0 


17 


11 


5 


0 


0 0 0 


0 


0 


0,0 


7 


7 


19 


0 


o o-o 


a 


0 


.0 


0 


18 


2 




w 


WWW 


R 


w 


w 


w 


16 


17 


0 


w 


R W W 


R 


w c w 


w G 


20 


13 


o 


w 


www 


W 


R 


w 


w 


11 


22 


0 


R' 


www 


R 

\ 

R 


W 


w 


w 


1 9 


9 1 
Z 1 


u 


W 


W W R 


W 


w 


w 


22 


n 


?• 


M R R 


R 


R 


w 




30 


3 


Vw W R 


W 


if W 


to 


16 


17 


a 


w 


W WW 


R 


W 


w 


w 


18 


15, 


'-? ; 


W 


R W W W 


6 


0 




i <; 

X J 






R 


www 


W 


w w 


w * 


19 


- 14. 


0 


W 


wo^o 




• •$ 

■ i 


hi 


6 6 


R 


WrW W W 


W : R;w . 


.7 


26 


• 0 * 


W 


WWW 


R 


w w w w 


6 


. 27 


. 0 


w 


W W^R R 


w 


w 


w 


7 


26 






c 








<*\ 

' f 'J> 






















t > 


Qi r 














6 






MODE ' 

















-43- 



ERLC 



69 



r Answer Sheet Study - II 



\ w In doing this, of course, one would 
have .to pay attention to the fact that the 
data afe tabulated for both fall and spring 
and derive" a sum of "R's" for a particular 
item separately m for thl* two times the test : 
was administered. 

Dividing the number of "R ! s" by the, 
number of cases would, give the percent cor* 
rect , or the item difficulty), which has been 
tabulated elsewhere and commented on at some 
length. 1/ „ 

By and la,£ge the Rights, Wrongs, Omits 
mode is the preferred mode of distributing 
the pupil responses for class use and cer- 
tainly is much easier to work with in evalu- 
ating the protocols for a particular stu- 
dent. ' c 

Perhaps it would be helpful to conclude 
this discussion of the rosters of pupil re- 
sponses by indicating in a summary fashion 
just what one would do with these data. 

1. The consistency of the response 
from fall to spring would indeed be one of 
the first things to lopk for. (For a better 
evaluation of the q advantages of* doing this, 
it is suggested that the reader review the 
section of! categorical analysis earlier in 
Part II.) . * 

For instance, one can tell from tfhe 
responses for pupil #001 that 14 of the 33 
items were "answered correctly both in the 
fall and spring, and therefore, assuming* 
that none of these came about by fortunate 
guessing, a very large proportion of this 
test was nonfunctioning for this child. 

However, it is most enlightening to 
note that there are no "RR" responses beyond 
item #22., so apparently at about this point 
in the test it seems to be a viable test for 
this child if measurement of change is a ma- 
jor goal. 

There are eight instances where the 
choice in the fall is Wrong and the spring 
response is correct; i.e., the Wrong/Right 
combination. These are • the eight items^fh^t 
suggest actual learning may have taken 
plaice*. 

^2. The number Right, in the spring 
should substantially exceed the number Right 
. in the fall . 

ll See pages 8 to 14, Part II. 



For exaniplY, case #008 had 20 Right 
in the fall but only. 19 Right in the spring. 
Such a situation could come about if the 
test was highly specific to the curriculum 
\of the grade below and the child had not at 
all been exposed to the content of the cur- 

• ifent curriculum at the time he took the test 
in the fall. 

This would heighten his opportunity 
to improve his score as the result of seven 
months of instruction; but apparently the 
original score (or the* final score) was not 
valid, since there is an actual loss! 

Case #002 would appear to be a case 
falling in this category, but an examination 
bf the individual responses item by item 
makes .one wonder. In the last 13 items in 
the test there are* only 3 Right in, the' 
spring, and all of the remaining responses 
(that is, 10) are Wrong; so it is patently 
evident that guessing has occurred in this 
particular instance . 

3. There are other ways of studying 
these data, limited only by a person's imag- 
ination and actual knowledge of the case. 

A teacher examining data of thj.s 
sort, knowing the child and knowing his day 
by day performance, can find this kind of* 
exercise' enormously illuminating. This ££n 
result in a decision to consider a test in- 
valid, so far as the total scbre is con- 

• cerned (if a pattern like case #002 is 
found) , largely due to the very erratic 
t^pes of responses to be found in the spring 
compared to- fall. 

4. To generalize broadly, in non- r 
guessing situation the Right responses will 
constitute a very large majority of the 
items attempted, with few Wrongs and Omits 
in 1 indirect, proportion to attempts. 

- • Where guessing- is rampant, a pu- 

.pil's total score will approach the average ^ 
random score - with considerable chance var- 
iation in both directions. 

Where some knowledge is present and 
some item response, is marked most of the 
time (i.e., very few Omits or none, at all), 
6ne*must proceed cautiously. Data from the 
predicted score analysis and from the R/A 
analysis will help, but there is NO infalli- 
ble way of identifying which "R" responses 
are guesses and which are the result of 
learning . 
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SAMPLE CASES fpR ILLUSTRATIVE PURPOSES 

We have ^selected some sample cases' 
drawn from the actual roster of pupils for 
ttte item analysis made in connection with 
this study. On the Item Analysis Data sheet 
the recording of the choice made by £he pu- 
pil (i.e., 1-2-3-4-5 or a-b-c-d-e, whichever 
it might be) has been changed to the Right- 
Wrong-Omit mode of recording item data with- 
out regard to the alternative chosen. 

Thus it is possible, without the use of 
a key, to count the number Right for any 
segment of the child's item analysis re- 
sponse patte-rn. Rights divided by Rights 
plus ^ Wrongs gives the guessing ratio (as it 
was originally called), which of course is 
the r4tio of Rights over Attempts (R/A) . 

We have chosen three samples from the 
Random Sample population. In addition to the 
specific item data noted above, all other 
available information concerning each of 
these three cases has been collected and 
considered so as to gi^ve as compl'ete a pic- 
ture of each pupil as possible. These data 
have been recorded on the Individual Profile 
Chart and,Personal Data Sheet for each case. 

The answer sheets for these children 
for. both spring and fall are available and 
they have been examined for any departure 
from an acceptable method of marking. 

Wa have taken a quick look at each 
child's performance on all of the tests he 
took as it is. !, on view" on his answer sheet. 
We found nothing that looked atypical; that 
is, nothing that , would say "Stop" to the 
computer under regular scoring routines for , 
any of the cases.. 

Each child's school learning potential, 
or.IQ as derived rxom the Ofcis-Lennon Mental 
Ability Test: Elementary II: Form J, has 
been checked and is '-considered' along with 
other data. The sex' of each child in this 
analysis has been noted, although this^ ap- 
pears*. to'be of little significance , accord- 
ing to W analyses of the. data as a whole. 

We know the child's birth date and his 1 
testing date (and, therefore, his chronolog- 
ical age), and with this information p.lus 
the score on the Otis-Lennon Test we have 
computed the Deviation IQ by looking it up 
in the appropriate tables* provided by the 
publisher to check the information already 
•written down for each pupil. > 

Unfortunately, we do not have the* advan- 
tage of seven months of almost daily -observa- 
tion of each child; i'.e., every day of the 
week except Saturdays and Sundays for 140- 
150 days. This is the big advantage of the 



classroom teacher's observation, and it 
would be foolhardy indeed for anyone inter- 
preting test scores ever to ignore it. f 
Even the observation for the s-hort period 
of time prior to the administration of the 
tests in mid-October is very valuable, es- 
pecially if an appropriate system cff cumula- 
tive records is in force. 

Information concerning the children 
moving up from the lower grades should al- 
ways be passed along to the next teacher at 
the beginning of the school year - objective 
information, especially, as well as observa- 
tional evaluations . 

At the upper grades this more often is 
done formally through a cumulative ^ record 
card, but such information is infrequently 
passed on from level to level; i.e., elemen- 
tary to junior high to senior high, etc. 
Computer technology has done much to change 
this omission in places where *it is avail- 
able, like Dade or Pinellas Counties in 
Florida and hundreds of other large. city and 
county units. 

It will be recalled that part of our 
analysis has been done by stratifying the 
data, not only in the usual ways but also by 
separating the sample into a group of those 
.who do not attempt all of the items (i.e\ , 
follow the supposedly classic ^patterjr - see 
Sample A) on the one hand and, on the other 
hand, into those who attempt all of the 
items, regardless of how many they answer 
correctly (see Sample B) ; i.e., we shQw the 
two extremes of. a continuum and not a di- 
chotomy .v * 

A .peripheral value of having the item 
analysis presented in the Rights -Wrongs - 
Omits mode is* that it not only allows the 
teacher to get the general pattern of a 
single pupil's responses item-by-item for 
comparison with the it£ms as presented in 
the test booklet, but by summing the columns 
for the class fosters presented in this * 
fashion one also attains the number of 
Rights, Wrongs, and v 0mits for the class as 

a whole . ( . 

f 

This information both -for • the individu- 
al and for the class, especially with re- 
spect .to those it;ems that*fiave been" answered 
incorrectly or omitted (or have been an- 
swered inconsistently from fall to spring, 
in ttiis particular study), provides informa- 
tion that, is surely as worthwhile as the to- 
tal score on the tests interpreted in terms 
of any norm, whether local or national. 

v We 'Show this type of analysis for all 
three sample cases , but it certainly 'is too 
laborious for the teacher to do ,fo.r all 
children. \L i£ entirely feasible if 
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computer assistance is available. We have 
previously shown a sample page for one test 
in class roster form (see Figure II-9). , 
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any contained system (i.e., a re-, 
system in which a child chooses his 
from a selected number of alterna- 
there is the possibility of a chance 
e, so a Right response either in the 
in the spring, or even in both fall 
ing,- is not incontrovertible .evidenpe 
e child knows the knowledge or skill 
d\by the particular item. 



It is perfectly practical and desirable 
to assume that a Right response in the fall 
is more likely to indicate .knowledge (i.e., 
evidence of "knowing" rather than a chance 
response, especially in the early, easier 
items) , but a consistent Right response in 
the spring leads more convincingly to the 
conclusion that the child does, indeed, know 
the skill 'or information measured by a par- 
ticular item. It is too late to wdit for 
that if the tests are to be used during the x 
"between -testing" period for improving in- 
struction . 

If his reply is inconsistent (i.e., 
fall "R", "W" spring) his first "R" response 
was probably a guess. 



However, the only way that a teacher 
can ever know whether a particular knowledge 
or skill is really mastered is to observe 
the application of the child's knowledge or 
skill in a whole -series of everyday situa- 
tions where that knowledge or skill is es- 
sential for success on a particular task. 

To know" whether the child truly knows 
the number combination of 8+9 "for sure," he 
must be put to the test in a number of var- 
ied situations where a basic, segment of the 
total task requires that he know and apply 
the knowledge that 8+9=17. 

H)is, kind of information is beyon4 our 
knowing on the basis of the test data avail- 
able^ even in a. fall-spring testing program., 



Here again, the importance of the 
teacher's constant reevaluation of the know- 
ledge^ and skills of her pupils in a repeti- 
tive and "maintenance of skills" fashion is 
perfectly evident. This kind of approach, 
when done by an appropriate test, would be 
what is more widely known now as "criterion 
reference testing"; i.e., no norm is suppos- 
edly required, although this is more a 
"seemingly so" situation than an actual one. 

Let it be taken for granted, in this 
particular instance, that the primary pur- 
ppse of the testing program such as the one 
presently considered is to improve instruc- 
tion. The testing program does this by pro- 
viding objective evidence that material pre- 
sented either has been known at one time or 
not known. (R versus W) v 

If not known, 'it has to be learned 
during the course of instruction-between 
first testing and second testing if the re- 
sponse is "WR" or "OR" - the only catego- 
ries that really measure.* 

It assumes that the teacher will take 
the evidence of the first test, when summed 
overall for the class as a whole, to indi- 
cate -areas of weakness which need to be 
strengthened. When considering the pattern 
of responses' far all items on a particular 
test giyen^in the fall for a particular 
child, the fee&cher will try to assess his 
status, identify areas of weakness, and mod- 
. ify and strengthen his instruction at cer- 
tain points so as to build on what the child 
does know and to provide the support an<dL 
help he # needs to learn what he should know 
in accordance with the' local curriculum.. 

With, this background,, let us consider 
now the cases that have bee$ % selected in the 
manner indicated above case by; case. Each 
case data are presented'on a separate page 
which contains^ all of the ^ayailable* informa- 
tion, about a particular dhi^d,, biif. these 
pages are run into\the textrN^ri, such a way 
that a discussion o%$l partictilar child im- 
mediately follows thie^presentatlpia of alF 
the available data cd^'ierning that child. 



■ >*« * * 



Answer Sheet Study - II 



SAMPLE A 

INDIVIDUAL PROFILE 'CHART AND PERSONAL DATA SHEET 

New Hampshire Statewide Testing Program 
Otis-Lennon Menta I . Abi Ijrfy Test;- Elementary II;. Form J - Fall 1969 
Stanford Achievement Test: Intermediate I: Form X - Fall 1969 and Spring 1970 

GRADE 4 



Case » • RANDOM .SAMPLE y/ TITLE I 

School: Public Parochial y/ City or Town NASHUA 



Boy v^Cirl 



Date of Fall Testing |0|l3lb^ Date of Birth Age;. ^ y ears (p m onths 

Median Grade 4 Arc, Fall 1969: vears U months^ - Random Sample y/ Title i"^ 

Norns- OLMAT - National - DI Q 107 Percentile Rank- Age. Ul Grade ,7/ Stanine: Age^Grade^. 

SAT - State - Random Sample y/ Title I <t 
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SAMP 



Our firs 



^case is a boy, deliberately, 
chosen from the^random sample for reasons 
that will appear shortly. His age was 9 
years and 6 months as of the date of testing' 
in the fall; exactly the median age of the 
group. 

He. had an Otis-Lennon DIQ" at that time 
of 107. His Otis-Lennon percentile rank ac - 
cording to age was .67, which corresponds to 
a stanine of 6. His grade placement percen- 
tile rank was .71 on the same test, and it 
also corresponds to a stanirie of 6. 

Thus we 'have youngster who is ex^&tly 
at age for .grade but who is a little bright- 
er than the average in terms of measured 
mental ability; i.e., a little better per- 
rormance should be expected of him, all 
things considered. 

nterpretation 



V 



His school performance, as shown on the 
Individual Profile Chart for the five 'tests 
in which we are presently interested, indi- 
cates that he earns stanines which generally 
are in the 6 or 7 range, in the fall with one 
8, namely in Arithmetic Applications. 

% He also has a 4 stanine in the fall in 
Paragraph Meaning. He is probably signifi- 
cantly below the reading grade level of the 
fall random sample of children tested in 1 
1969. y 

This is a rather unusual situation in 
ligjit of his Otis-Lennon stanine of 6. 
There is Veaspn to suspect that perhaps it 
may be artrue reflection of his situation in 
viejw of a very substantial gain in Paragraph 
Meaning during the seven months between 
tests, hopefully due to some successful re- 
medial instruction. 

j The Personal Data Sheet also gives fall 
and spring raw scores (Rights), the number 
of Wrongs, the number of Omits, the number 
of items in each test, and the R/A ratio. 

This" case is most notable for the num- 
ber of Omits, reflected in the relatively 
very high R/A ratios. This use of the omit 
technique, rather than random guessing, is 
convincing preliminary evidence that the 
te#t data are valid. 

.The earned stanines are shown on the 
plotted- profile. «, The stanines are based up- 
on the distributions of scores for the ran - 
dom sample and were computed separately for 
fall and spring. 'The stanines used in the 
fall and based on the fall random sample are 
shown as a solid line^ on the profile; the 



spring, stanines are shown as a broken line. 
Thus growth or change is reflected in devia- 
tions from the stanine average (5) from sub- 
ject to subject. ^ 

For example, a consistent upward trend 
in such an instance indicates more than av- 
erage growth relative to the median of the 
conversion sample; a dowm*ard__tren^. the 
opposite . • 

1 

One other item o^^lr^ftffmation which 
looks interesting (an<Hfo some extent sug- 
gests a problem area)' is in the Arithmetic 
Applications data, where he makes a score 
o£ 20 in the fall but nas a gain of only 2 
points, to 22, in the spring. 

However, the score of 20 gives him a 
stanine of 8 in the fall randort sample. He 
probably already had been exposed to and had 
learned, to a very substantial degree, most 
of the material that was presented through 
the fourth grade. (In or out of school? A 
transfer student? Naturally gifted in the 
t^umber area?) 

His failure to make a gain in score of 
appreciable amount in Applications, though 
he gained substantially in Concepts, might 
very well be due to the fact that he did 
get very much additional-exposure to prob- 
lem-type material consistent with his abil- 
ity to perform as indicated by the stanine 
of 8 in the" fall. 

This high math score and corresponding 
stanine versus low reading score and corres- 
ponding stanine in the. fa 11 also is a common 
indication of a reading difficulty. 

'He has quite apparently used the omit 
technique generously as a. "don't know" indi- 
cator in every test both fall and spring 
with the exception of the spring Arithmetic 
Concepts Tes^t, where he omitted no items, but 
still came up with an R/A ratio oT .84. 

This is not too surprising in view^of 
the fact that there are only 32 items in 
this" test anyway, and his original score was 
17" Right -.and 6 Wrong. In the spring, he had 
only 5 Wrong and 27 Right, for a gain of an 
extraordinary 10 points 7 4 



Comparing his predicted and actual 
scores, we see that they are not only high_ 
but generally fairly close (again VLth the " 
exception of Paragraph Meaning),, all of 
which supports the conclusion that his read- 
ing was a problem area in the fall. , • " 

9 

This possibility of a correctable read- 
ing deficiency is great in view of the fact 
that he makes an enormous raw score gain in 
Paragraph Meaning, from a raw score of 19 to 
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a raw score of 36 in the spring (stanine 
gain, relative to separate sets of fall and 
spring stanines,. of 2) while he improved the 
R/A ratio - .66 in the fall to .88 in the 
spring. It is the latter piece of informa- 
tion that is most convincing. 

On the whole, this child is not one who 
is going to give "anybody any trouble in 
schbol in terms of his subject matter ori- 
ented performance* 

His profile in the spring is remarkably 
ui\iform in comparison with^his measured men- 
tal ability, with the noted exception of 
Concepts where he has exceeded the expected 
stanine substantially. Everything else is 
within chance limits o£ his Otis-Lennon ) 
grade stanine of 6. 4 

• \ 

The effect of regression, it must be 
noted, in above average ability is to main 
tain the status^ quo or regress toward the 
mean; he did not regress. 

The Shatter of making a test profile fo'r 
any child is one of great concern. It shows 
a great deal graphically if the profile is 
in comparable units. Hence the scores have 
been profiled in separate stanines for fall 
and spring, which are comparable because the 
-gggup is comparable; i.e., identical, in 
fact. This point is rather subtle but of 
great significance. 

Saying this makes it necessary* for us 
to try to clarify the idea back of this 
method of indicating change or inconsistency 
in growth pattern. 

This writer has long advocated profiles 
in such comparable stanines as a way o"f re- 
flecting growth rather ^than measuring itUi- 
rectly, by the magnitude of ,a change in some 
kind of standard score . 

A direct measyre of growth has long 
been sought as a highly desirable statistic, 
but this has proved to be almost impossible 
to achieve in any v kind- of continuous stan- 
dard s.cores because;the Continuous growth 
curve <i.e\, the line of relation drawn 
through medians or means) varies in 'slope 
from subject to subject. * 

Any set of scaled scores that attempts 
to do- what Thurs tone f s.. scaled scores are 
supposed to do^ namely create a Hind of ar^* 
tific^l absoiute zero and to, scale the 
s.c ores'/along a continuum' from- the very be- 
ginning grades to the highest possible 
grade, is doomed to failure as a measure^f 
comparable growth unless the growth curve is 
the same in all areas. Fqrthlrmore, the 
growth potential of a child is .just not go- 
ing to -be the same,. from subject to subject v: 



or from one grade to another, for reasons 
too numerous to mention. ' 

In a subject such as Word Meaning,* 
growth is very subject to influences from 
the total outside environment as well as 
from in-school instruction; while in another 
subject (such as Arithmetic Applications or 
Problem SolvJLng) is very largely a- school- 
oriented skill with, generally, little or no 
outside incidental learning. 

Fundamentally, the idea here is that a 
child ',s. "growth 11 (i.e., tested development) 
is reflected by the extent to which he devi- 
ates from the average of his peers and/or 
from his own average from year to year al- 
lpwing for random or chance errors (standard 
error of measurement) . 

In addition to the tests which we are 
including in ounprofile, we have added one 
more statistic; namely, a Composite Prognos- 
tic Score based upon weighted stanines. 

Such a composite is by far the most 
stable valtfe of any other single stanine 
score. Obviously, because it will be made^* 
up by a weighted average of the stanines 
within the total number of tesjts of achieve- 
ment plus the measured mental ability test, 
the item base is much greater. This makes 
for a more reliable individual pupil refer- 
ence point. 

Weights can be assigned to the tested 
elements either by statistical methods or by 
judgment. In this case, the weights, used 
were judged weights very similar to those 
used for "years in. the writer's New Hampshire 
statewide 8th grade programs and other simi- 
lar programs . 5 

Case #452 has stanine composites of 6 
fbr.bqth fall and spring, .using weights as 
liste'd below: 1/ 



OLMAT Raw Score 

SAt Paragraph Meaning • 

tl Arithmetic Computation 

M Word Meaning 

M Arithmetic Concepts 

" Arithmetic Applications 



3 
2 
2 
1 
1 
1 
TTJ 



V 



' 1/ The Composite Prognostic stanines for 
this individual were obtained by averag- 
* ing his separate stanines; in general 
practice the sum of the weighted stanines 
are re-scaled to avoid the, shrinking ef- 
- feet of an' aver aging procedure. 
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Constant Failure as a 
Personality DetermTnent 



Ado 
child vi 
range, a 
constant 
be at th 
shame ! ) 
velop a 
toward h 
r 

Tes 
can neve 
failure . 
year to 
position 
measures 
site Pro 
u'lat ext 
an enpir 



ther child, perhaps a slow- learning 
th stanines running in the 2-3-4 
11 too often will experience the 
stigma of failure because he will 
e low end of the stanine scale (for 
and, as a result, will rapidly de- 
negative attitude toward school and 
is own learning potential. 



t results consi 
r be considered 

Therefore by 
year , using his 
based on both 
fjr achievemen 
gnostic Score, 
ent he varies f 
ical basis and 



stent with potential 

to.be evidence of 
tracking a child fr,om 
own weighted stanine 
mental ability plus 
t to obtain a Compo- 
we are able to see to 
rom year to year on 
thus grasp more firm- 



ly the type of individual we are dealing with. 

Of ail the scores reported, the Compo- 
site Prognostic Score gives the most practi- 
cal single estimate of what could be expect- 
ed from this pupil barring some traumatic 
changes in some aspect of his situation. 
The results in this case bear out this con- 
tention . 

Let us abandon all talk of success or 
failure where test scores are involved and 
we will be well on our way toward obtaining 
acceptance from the child of what he is as 
regards verbal' learning and without stigma , 
because it is no one's opinion but a rerlec- 
tion of fact's! 

This assumes the development and ac- 
ceptance of the practices and attitudes ad- 
vocated in this report, including especially 
the reality of great individual differences. 
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SAMPLE B 

INDIVIDUAL PROFILE CHART AND PERSONAL DATA SHEET 

New Hampshire Statewide Testing Program 
Qtis-Lennon Mental Ability Test:. Elementary II: Form J - Fall 1969 
Stanford Achievement Test:- Intermediate I : . Form X - Fall 1969 and Spring 1970 

GRADE 4 



Case * tUL * . ' , RANDOM SAMPLE \/ TITLE^I 

School:. Public S Parochial City or Town QoVE-R 



Boy y/'Girl 



Date of Fall Testing lOfcojlt^ Dj£e of Birth Ul>Jt>0 Age: j^vears II months 

Median Grade 4 Age, Fall 1969:- J^ears jjpjnontlhs - Random Sampler ✓ .Title I 

Norms-, OLMAT - National r DI Q IQO Percentile Rank: Age. 50 Grade^34> Stanine: AgeSjSradeV 
SAT - State - Random Sample S Title I 



Stanine 
9 
8 
7 



GRADE 4 STANINE^ - Fall ( ) Spring ( ) 

Word Para. Arithmetic « * Otis- Prog. 

Meaning Meaning Comp. Cone. ' App : . i / le'nnpn Score 



Comp . 





6 












cr 






4 








3 
2 








V 


1 


F S 


F ^S 


F S 




Stanines 


<2 H 




to 5 




RIGHTS 


An 


11 3U 


ki 




Wror^gs 










Omits 


0 0 , 


0 o" 


0 0 




No. of Items 


T6 


60 


39 



F S 

5X 



S • 



5 5 



0 0 

32 



oo 

33 



Fall 

i 

0 

80 



H 5 



7$^(*/*iikj£ 2kJ!>. .j8Ljn 3±m .]& 

Pred. Scored ill lil JLL 111 Ml &t Q*L 1^]^ * 



Stanine 
9 

8 * * 
7 

' 6 
5 

4 • 
3 
2 
1 
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SAMPLE B m 1 

Our second sample case is summarized on 
the Individual Profi'le Chart and -Personal 
Data Sheet preceding this interpretation. 

This case was drawn from the random 
sample and is a boy going to public school* 

,who, at the time, of fall testing, 'was 8 
years^nd* 11 months old compared to the me- 
dian age for the grade of 9 years and 6' 
months. His Otis Deviation IQ was 100. Hi's 
percentile rank on the Otis -was .50 on dn 

♦age *basis and on a grade bas^ only .36. 

Note »first the discrepancy in .the chro- 
nological age. This child i*s about 7 months 
younger than the average age. in the grade, 

* but his intelligence level as recorded is 
about typical for the community in question. 

Seven months minus difference at the 
fourth grade level in^ terms of chronological 
age, -and in this case quite unquestionably 
an equal or greater difference in mental 
age,~cai} make a substantial difference in 
achievement. 

This child must have been admitted to 
school about as early as the law wpuld al- 
low. Without question, he would be better 
off if he were in grade 3. rather than in » 
grade 4; all subsequent data support this 
conclusion . 

At this point, with all the evidence in 
hand, we must ask why hte was ever allowed to 
get into 4th grade! k true ungraded primary 
system, would have certainly found it highly 
desirable to give him at least four years to 
complete grade 3, and possibly even £ive 
years ! 

• Turning now to the data at the bottom 
of the profile sheet, we, see first one very 
nptable element; namely, that he -has omitted 
no items on any test including- the Otis- 
Lennon . 

In other words, he has immediately in- 
dicted himself as a guess ing^per son since, 

* with this chronological age and this level 
of mental .ability , he 'could < not possibly be 
working effectively "in the latter part of 
any Stanford" Test , all of which have been 
shown to be difficult for the average child 
in the state as a, whole, 'and certainly much 
too difficult for the younger children \r\ 
grade 4. ^ r 

Looking first at his earned' score on 
the Otis -Lennon Test, vje 's^ee that: he re- 
ceived a score of 26 Right out of the 80 , 

• questions^, which is only 10 points of score 

* above the chance level . 'This immediately 

' raises the question as to how he could get 



an IQ of 100 with such' a relatively low 
score . ^0- r 

-."Thje answer mu$t lie in the fa'ct that he 
was taking a level of the Otis : Lennon that 
was too hard for him by virtue of «the fact, 
that he was in the fourth grade in spite of 
his- being. underage for the, grade, and the. 
Otis -Lennon level Aised was one which was 
recommended for use at. the" fourth grade lev- 
el but this" was the lowest grade at which it 
should be used. 

Furthermore, the directions do not make 
any allowance whatsoever for the influence 
of guessing. A glance at the^pattern of his 
responses on the Ot£s -Lennon^ as shown on 
the Item Analysis Data sheet, .indicates that 
he answered a few items at the beginning of 
the test correctly and then began a Right- 
Wrong type of response which degenerates at 
about #17 into a pattern' whi^ch could be ac- 
counted for almost wholly by random marking 
without reference to the test booklet at 
all. 

. It is possible that he really only an- 
swered about twelve questions on the basis 
of knowledge, and the remainder of the 
flights are largely due to chance. A. score 
of 12 in conjunction with his age would 
yield an IQ of only 77, apd not 100. This 
is probably an underestimate of his mental 
ability "4evel, but it certainly is strong 
evidence that the 100 is too high. 

For example, he does get an occasional 
item correct well along toward the end 'of 
the test, the most outstanding example being 
item #75. However, this is preceded by a 
string of five Wrong responses and the five 
subsequent items are all answered incorrect- 
ly as well. 

The Otis -Lennon« is an 80 -item test with 
five alternatives and no specific warning 
against guessing. The average chance score, 
therefore, -on the test is 1/5 of the total 
number of items,' or 16, and, his earned score 
of 26 falls only 10 points above this aver- 
age chance level. 

It*, woulcl be' entirely within the realm 
of possibility for him to have gotten a 
'score of as high as 26 without ever poking 
at the tfest booklet whatsoever, but simply 
marking the answer sheet; but the fact that 
he, did answer a sequence of items at the be- 
ginning of thte test correctly is convincing 
evidence that, this certainly was not the 
case. 

His » percentile ranks on the Otis, both 
on the basis of thefege group to which, "he 
belongs and the gracfe group are, based upon 
the scot^ of 26 on the assumption that this 
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is a valid score. Even so, he achieves a 
percentile rank* of only .50 when compared 
with other children of the same age,* who 
would not typically be in fourth grade and 
certainly not if- their performance during 
the first three grades" was what could very 
well be anticipated it was from the data we/ 
have at hand . 

« 

His percentile rank according to grade 
is only .36, meaning that his score of 26 is 
reached or exceeded by 64% of children 'in 
the fourth grade in the national standardi- 
zation population on which the Otis norms 
are based. 

A Consideration of Rights , Wrongs , and Omits 

Looking at Che data provided at the 
bottom of the profile page, we see first of 
ail that the Right scores are low, not only 
on the Otis but for all of the tests in the 
Stanford Battery. In fact, the only in- 
stance where the number of Rights exceeds 
the number of items answered incorrectly is 
in the spring of the year, when he answered^ 
36 items Right in Paragraph Meaning and an- 
swered incorrectly 24. 

His gains from fall to spring are rea- 
sonably; good. In fact, the gain from 19 to 
36 in Paragraph Meaning, If it could be tak- 
en literally, would be an astonishingly high 
gain, and his gain in Word Meaning from 6 to 
17 is hardly less surprising. 

Riftffits / Attempts 

As indicated in the text, the R7A',' { ratio 
shows the proportion of all items attempted 
v%hich were answered correctly. As one would 
expect from the data previously presented, 
he tends to be under the median values for 
fourth grade Children, and mostly by sub- 
stantial amounts. 

Actually, he does, not exceed the median 
in any instance either fall or spring, but 
his ratios tend to be better in the math 
field in the spring than they were in the 
fall. This is also a group tendency, 
strengthened by tffe fact that these pupils 
had been studying related material for a pe- 
riod of seven months, and therefore by some 
amount had reduced the opportunities to 
guess by v their actual increment in kn6w- 
ledge. 

In the absence of any other informa- 
tion, one would conclude that this child had „ 
made rather substantial progress ,in both vo- 
cabulary and reading during the seven months 
between testing and that* his gains in the 
Arithmetic area, although small, have to be 
interpreted in view of the fact that the 
gains for the^state as a whole also were 



small. 

At this point, we are led to raise the 
hypothetical question: Where did this child 
pick up the "Attempt-All* 1 pattern of re- 
sponse? Was it early, in the attempt to 
live up to a role in which he was quite un- 
wittingly cast by being admitted at such an 
early age?' \ s< % 

One must further wonder to what extent 
hie performance in class was comparable to 
his^per formance on the tests that he took in 
the Stanford plus the Otis. 

In other words, according to the teach- 
er's observation did he appear to read fair- 
ly well? Was his seatwork *Ln arithmetic 
reasonably good? Or, on the other. hand, did 
the teacher perceive hjun as being essential- 
ly a s low^ learning child? Was there any 
recogrMfcion of the fact of his being under- 
age asKrell as probably below average in 
mental ability, if one allows for guessing? 

Summary 

After a careful examination bf the test 
information, taking into account tne pro- 
clivity of this child to mark all responses 
regardless of knowledge and his generally 
poor R/A ratios, we have tq^conclude that 
his testa were substantially invalid as mea- 
sures of his true status both in the fall 
and in the spring, although the tests do 
suggest some rather amazing^improvement in 
the language area during, the course of the 
year. . * 

The true nature of this child's perfor- 
mance is really seen best in the summary of 
his item by item responses, as we see the 
pattern of chance s responses emerging^cJ-ear^Ly 
after a very few of the easiest items have 

, been ahswered. ^ 

\ 

The Predicted Score 

As for all cases analyzed, we used.the 
formula Predicted Score = A + (A'/C x d} to 
predict this pupil's scores on the Stanford 
Achievement Test. 1/ These 'appear in the 1 
last line of the Personal Data Sheet. 



For every test except Word Meaning in 
the spring, "which in itself is a curious 
situatioa, the proportion of all items 
marked Right seemed adequate to make this 
prediction reliable. However, from the pre- 
viously established fact that guessing is a 
"way ,of life" for this child, we know that 
even .here his Rights scores for the first 20 
(23) items probably are inflated in most 

1/ See pages 11-34 and 35 for further expla- 
nation of this procedure. 
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tests. (See Item Analysis Data following • 
the Individual Profile Chart and Personal ) 



Data 'Sheet.) 

in Word Meaning, only four responses in 
the first twenty are correct, and one of 
these is. followed by a Wrong response in the 

spring. 

in paragraph' Meaning, there are ten 
Rishts in thi fir/f*23 in the fall-, but two 
of S rSese havi a 4" response in the spring. , 

To take one more instance in Arithme- 
tic Concepts nine out of the first twenty 
fall responses are Right, but of these four 
are followed by Wrongs io the spring. 

in -spite of all this, the agreement °of 
Votal Rights (by machine) and our predic- 
t°ions are .no* off badly. Of the ten Predic- 
tions (fall and spring} ' "g* J^diction 

his fall prediction is fractionally nigtier 

is the type of pattern expected of 
* h Id i e prediction lower 

a gufssing child, 1 .e. , P r |" twenty 'items 
than\machine scores. The f irst ^ en ^, 
are tWeasy items, where guessing is less 



necessary because it is actually easier (or 
«£• satisfying psychologically} to answer 
n,,t of knowledge than to guess; while later 
items of incref sing difficulty _ are impossi- 
ble ?o answer, except byvguessing, in almost 
every instance. 



One must conclude on every basis that 
this child's performance on this test should 
^ Lmnletelv disregarded as a valid measure 

ol nfftaSSd. " g-erally P^f'truly""^ . 
by substantial margins what he is truly c a 
pable of doing -and making* it very desirable 
to throw out the results totally. 

Perhaps the most significant thing that 
can be said about,this child is the fact 
St if one were*to deal solely with total, 
scores or with the stanine profile, entirely 
-erroneous conclusions could be drawn. 

It is only when one notes that there 
are no omits, and then actually looks at the 
St^f item. 'responses, that the conviction 
that his test result is invalid gro&s so 
strong as to make it necessary. to decla re 
rhP case totally erroneous and actually a 
.detriment to the child to be retained in his. 
record. 
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SAMPLE C 

* 

This third and last ca?e we will dis- 
cuss is very interesting because it is so 
different from the two previous cases or 
from a typical profile pattern. 

It is a girl in a small city. As of 
the date of testing this girl was 9 years 
.and 1 month of age, making her five months 
younger than the average of her grade at the 
time of testing. In this respect, she is 
similar to the previous case. However, she 
differs radically in that her Deviation IQ 
on* the Otis-Lennon Mental Ability Test is 
128 and her Otis-Lennon percentile rank by 
age is .96 and by grade is .93. 

Turning now to the data tabulated at 
the bottom of the Individual Profile Chart 
and Personal Data Sheet and considering 
first her fall performance, we see that she ' 
has made use of the Omits option' in ^very 
one of the five achievement tests as well as 
the Otis-Lennon, where she had a Right score 
of 57, a Wrong score of 15 and Omitted 8 
items . 

' Her R/A ratio is high for Word Meaning 
and Paragraph Meaning, as it is for the , 
Otis. It also is high for Arithmetic Con- 
cepts' and only slightly lower for Arithmetic 
Applications, but in both instances her R/A 
,'ratio-is^ higher in the fall than it is in 
the spring. , This is a reversal of the situ- 
ation found for the group as a whole. 

Her R/A radios for Arithmetic Computa- 
tion are low, quite unsatisfactory as a mat- 
ter of fact, with, values only of .32 and .34 
for^ fall and spring respectively. * The rea- ♦ 
sons for this become perfectly evident when 
you -?tudy the 4ata on the Item Analysis Data 
sheet.. 

This girl's gains ~<from fall to spring 
are notable in Word Meaning and Paragraph m / 
Meaning, but it is evident that something 
has gone very badly awry in 'the arithmetic 
field. . . • 

Considering* first Arithmetic Computa- 
tion, she makes a gpin from^fall to spring 
of only 1 point, gtfing from 9 7 to 10, and * 
both of these scores "are very near the aver- 
age guessing level. ; * 

In Arithmetic Concepts her scores are 
higher, but her gain in score is only from 
19 cd' 21, or Jl points o*f score 'compared to 
about 4 points average, gain for the. random 
sample. 

> m " « 

Her scores are more reasonable; in Conr ' 
cepts than they are in the other fields of • 
arithmetic, but the bloom is taken off the * 



blQSSom to some extent by noting tha^ she 
begins a guessing pattern almost fromthe- 
very beginning. - She has twcT'W 1 responses 
in the fi^st ten easy items ) she has two ( , 
"RW*' responses in the first fifteen items, ' 
. another for item #22; she attempts all items, 
in the spring but misses 'six out of the l^art 
fifteen items . • ; 

The items she answered "RW*' are partic- 
ularly serious, as such responses are almost 
a -precise indictment of the response as be- \ 
ing a guessed response. 

Arithmetic Applications in many ways is 
the most peculiar of all the tests. Ortly in 
the first block of five items does she dem- 
onstrate knowledge that you can depend on . 
In the second block, three tff her fall re- 
sponses which wtfre correct become incorrect 
in the spring ("RW").; in the third block, 
> three responses we're Right in the fall and 
Wrong *n the spring; before the end of the 
test? she has reversed two more responses 
that were originally Right to Wrong jk\ the 
spring, maki^ga total of eight "RW" re- 
sponses . • * 
* * ■* ' I 

^Her last, three groups Jare all suspect 
.re guessing/ Finally , -hei; 'score of 17 in 
the fall drops v «to a score* of ' 15 in the 
spring, implying no gajjifet all during the • 
seven months period*; t lfilf act * % a loss of^-2 
points as compared to:' a 'gain foV the total 
. group of, 4 points'* < 1 # 

• i • 

Her -Predicted Scores coincide fairly 
well w^Lth her earned scores for. aril tests 
~both fall and spring. • ' 

to 

The stanine profile of her performance, 
remembering again that these are separately 
computed stfanines for fall and' spring, would, 
suggest that she has moved alortg pretty much-' 
in synchronization from fall testing tc&v 
'spring, thereT-bteing -no- greater difference 
hetween stances than 1 points A-jdrop of 1 
point* is to be .found* in each of the three 
Arithmetic Tests, while a gain* of- 1 point. is 
to be found in Paragraph Meanings . in Word - 
Meaning, her- stanine is identica-lV 

.With her Otis-Lennon* score and BIQ, she 
should "have gre'atly exceeded state average 
performance. Hence, $t\e definitely has an 
arithmetic problem. Lack of .mastery of fun- 
damentals is, the- best* guess because this 4* 
a familiar pattern for ^very bright children 
who pften are lax in rote learning. 

. It has. b6en repeatedly stated through- 
put tfftfs report that Word Meaning and Para- 
graph Meaning are subjects in which status 
on. a standardized test depends almost as 
*mu£h on what happens outside of school as, on 
. school-learned knowledges and ski!4s. " r 
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Of th^ three Arithmetic Tests, Concepts, and multiplication tables) as a result of 
draws most heavily on the reasoning-type ^ drill jand Qpnstant repetition is boring and 
factors to be found in Otis-Lennon, and tn£§~ — aistasjteful . 

is the ' one arithmetic area where she is ; - * * 

above th.e statewide random sample in the 
spring. 



Conclusion ' . 

t ' * x ! 

This -child is not working up to capaci- 
ty in *any test and clearly "has what would 
amount to' a; specif ic disability in arithme- 
tic that very often characterizes the very ' 
bright chilci to whom over- learning of basics 
(e.g., lOO addition and subtraction facts 



If this writer w£re dealing with this 
particular child, his first step would be to 
investigate the arithmetic area more closely 
by identifying the pattern of answering the 
items "/{[or each of the three Arithmetic Tests 
Co discover tjie kinds, of mistakes the child 
fs ji>ak£ng, and specifically to decide wheth- 
er- or^Jiot these errors Were largely due to 
probable lack of mastery of the addition- 
subtraction facts /and of the. multiplication 
tables, as suggested. ;* 
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Part III 

Comparison of Title I Cases With the Random Sample; on all Essential Variables 



INTRODUCTION - 

From the beginning, we have emphasized 
the fact that we have two populations being 
treated identically so far, as testing is 
concerned; including the time of* the admin- 
istration of the tests, the conditions under 
which they are administered, an<J» all other 
similar variables. * 

* * 

1 This section or£ the study is similarly 
organized. The analyses of data follow the 
general format found in Part II.., This, study 
is not to be confused with an earlier report 
which also involved Title I children for the 
whole state as wel»l as the random sample of 
the state. 1/ 

0 

Our first conclusion must be that Title 
I children are, indeed, different from the 
total population (or from the random sample, 
already shown to be representative), and 
this difference runs through every test ad- 
ministered and all subsequent analyses. 

It would become rather boring and less 
productive to make a. routine comparison ex- 
actly the same as was done for the randorti 
samp re, so we will concentrate on differ- 
ences. ' 

" % . - 
' It is *eteseht\al* that a sufficient 
amount of detai]jedVcompkris6n should be 
Duilt into this reptoift to convince one that 
essential genera lizafcitms ct>ange greatly 
when, one "moves from a population such as* the 
random sample to the TitJ.e'1 jproup - o£n$r 
than the fact tKat the Titl-e I, .group perfor- 
mance drops on the' score, scaled. - 

The test* in questiorfr*may; be a mental 
ability test with a Deviation* IQ s or it may 
be Paragraph Meaning, or Arithmetic Computa- 
tion, or Science, In everyj case., there. is a 
drpp from random sample to Title 1^ *. 

In no case does the average^ of 4 ^the Ti- 
tle I group reach or exceed that of the ♦ran- 
dom sample, but in every case" some o'f the \ 
children in the Titfle I gaoup do teach or 
exceed the average score of children, in the\ 
random sample. . In other words', fhgre are .* 
overlapping distributions. 



1/ "A Description and Evaluation of the 
Statewide Testing Program in New Hamp- 
shire in 1968-69 and 1969-70 Undsr the 
Sponsorship -of title t and the Signifi- 
cance of the Data Obtained for 'Evaluation 
With This Activity." Prepared by. the* Tedt 
•Service and Advisement Center. 1-971**- 



No additional research was needed to * 
reach this conclusion; it is inevitable be- 
cause of the substantial variability of v 
children's ability and the spread of en- 
trance age over at least a year's span.' 

Some of the very bright children in the 
state have been included among the Title I 
children in this study, for reasons which 
cannot now be ascertained'because they are 
local and expedient in nature. 

We" can only assume that (in part, at 
least) the reason arises from the^somewhat 
unrealistic ib.asis by which the law provi4es 
for the selection oj^hese individuals. 
Some of the other more evidential reasons 
wili^be discussed later. 

The Original Title I Report 

It is pertinent to remind the reader 
that the basic score comparisons of the Ti-^ 
a tie I population and the r'andom sample, as 
v-well''as the state as a whole, were done in.. 
, grleat detail\in the first repbrt entitled: 
f A Description and Evaluation of the State- 
wide Testing Program in New Hampshire in 
1968-69 and 1969-70.," . ... ^ 

Much, of the data in the first part dr. 
*this secfcron taomes directly out of the ear-: 
lier reptort,^ and'it is 'highly recommended to 
anyone who is making a careful study of this 
report. that he obtain the earlier report 

♦ first to provide the necessary background. 

Section VI of the original report is 

* specifically concerned with the comparison" 
of- the random sample with the total state 

♦population versus the Title r group. 

To save the- t^me and bother of consult- 
ing the earlier report, or in some cases its 
unavailability to the reader, we feel im- 
pelled to- repeat here some"-of the essential 

findings*: v " : 

, 1. The Title I sample l avai lab le for, 

study cannot^ in any way 'be considered a ran- 
dom sample from the entire state - nor even 
of the group which normally would be consid- 
\ ered eligible for Title I assistance by 
"V strict adherence to the law.' 

* 

Some of* the -larger cities in the* 
state choae to go,, their own, way so far as 
evaluation was fconcerried, and there is noth- 
ing in "the national law to prevent their do- 
v #ing -so., 

, Hehcevtd some extent Qur Title 1 

v population must be considered a biased sam- 
ple or all Title I cases, in the state. In 
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, , % thB bias would not be in the direc- 

> . ti'on of iri'cfeease'd ability level of the Title 
'V I children tested in our group. It might 
/ even have •decreased it. Nobody knows for 
'sure; " 
."**.'/ 

•J , /We can be sure it was the Title I 
f \ $arapl£ ,£ncfuded in this federally funded 
pfcpgrrfm in Newl Hampshire . This is important 
\ b£c£yse 'it bears upon the extent to which 
S; generalizations -can be^made to other Title I 
* /programs ' in- other stat&S'.'' 

. r ' 2. The composition of the "random sam- 
ple group had approximately equal numbers of 
boys and girls; on the other hand, the Title 
I group' is disproportionately boys , having 
617;-V£rsus 39% girls. 



One could cons.ider this dispropor- 
tionate maleness to be a local bias if it 
"> clid not happen so frequently in so many 
' studies; of the disadvantaged "or handicapped, 
delinquent, or poor-achieving child in 
school. 

:: . • (This writer did ntimerous studies, 

fpr example, of delinquency and emotional 
* . 'instability in the schools of Pinellas Coun- 
ty, Florida ,' and founa* repeatedly that about 
two-thirds' of these cas6s were boys. The 
n writer also wag" in direct charge of the- cor - 
^ rec'tive reading program in the county, and • 
\/ here, ; again ^ about two-thirds of the chil- 
dren under instruction in the corrective 
^reading program were boys.). 

The literature' is full <*£ compari- 
sons of this^iort; therefore, one must as- 
sume that whatever the basis is for choosing 
the children for studies of \this sort it has 
very:, generally resulted in about the same 
disproportionate number of boys compared to 
-,girls. . % , 

* 3. The Title I sample is older than 
the random sample. . . 

,„ This follows " from the arbitrary and 
unreasonable entrance requirements held to" 
almost unif'orjnly throughout the state - and 
most other sta'tes,** for that matter. Minimum 
age for entrance into .grade one in - ttfis V. * . I. 
state varies, but generally children have to 
be 6 years old not later than December 1. 

The difficulties "-.found by Title I« 
children quickly show up and result in re- 
tardation unless the school system has ah 
ungraded 'primary system. \ * 

The Title I boys averaged- 9 years 
and 1\ months of age girls averaged 9 years' 
and 8 months of age - an interesting phenom- 
enon. The total population of Title "I aver-- 
aged 9 years and 10 months qf/age . 



The median age at the beginning of 
th>6 fourth grade testing in October for the 
state as a whole was. about 9* years and 6 
months . 

By contrast, the random sample, of 
children chosen for our study was slightly, 
younger and brighter than average, being 9 
years and 4 months of age; but this was a 
factor over which the investigator' had no 
control., 

The fact that this sample was 
tested in the spring for our convenience re- 
sulted in a substantial reduction in the 
number of cases identified by computer to be 
included in this study, as discussed in the 
original Title I report.^/ 

Comparing totals only, it appears, 
then, that there is a 6 months age differ- 
ence between the random sample (complete 
cases only) and the. Title I group. 

4. 92% of the Title I children fell at 
or below the average random sample Deviation 
IQ of 102. • 

The Otis-Lennon Mental Ability 
Test: Elementary II Battery: Form J was ad- 
ministered statewide at the beginning of the 
test program (fall 1969). The Title I boys 
earned an average Deviation IQ of 85; the 
girls, 88; and the total was 86. The aver- 
age for -the entire statewide population in 
Grade 4 wab 101. 

w At this point one must ask oneself 
if it was the intent of the lawmakers who 
framed Title I to provide a program for J 
slow- learning children - which, in effect, 
it did. 1 

The answer is emphatically, "No!"; 
it was to provide a special opportunity fo,f 
children coming from disadvantaged back- " 
grounds . * " « ? 

We have no way of knowing that the 
average mental ability (as_ measured ) of th*e 
parents of these children was lower than for 
the population' as a whole,, or whether the 
lower IQ's of the children in this program 
were due to the disadvantages under which 
.^they lived* 

) v It 1 s specious to say that a child 

-does betterif he's under stimulating cir- 
cumstances at home (and/or in his general 
environment) than he does if he's in a re- 
stricted and impoverished environment. 

Neither the disadvantaged alone 
nor even the most fortunate- people £a the - 

1/ Ibid. „ _ 
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state in regard to environment have arvy_jno^ 
nopoly on brightness. Very many of^our 
ablest people, in the. history of this coun- 
try especially, have come from homes of 
great poverty and hardship with very few op- 
portunities to "make something of them- 
selves" except as they went 6ut and found 
these opportunities on their own initiative^ 

J As new technologies develop in 

this technological world, they're going to 
develop because some people with creative 
ideas, regardless of their backgrounds, 
bring to them the dedication it takes to 
stick to something until the job is done. 



* What Mental Ability Tests Are 



Mental ability tests are nothing but a 
series of tests to roughly aort out and 
bring some order to the hierarchy of abili- t 
ty. They consist uf real-life problems, 
generally not school-oriented, which are 
stated in verbal terms but which require for 
their solution a variety of skills. 

Sometimes they involve knowledge of vo- 
cabulary. Sometimes they involve solving 
problems seemingly related to mathematics 
ami physics, in the type of thinking in- 
volved, but stated in simple and untechnical 
'Lerms . 

Ail of the problems involved in any 
"good* mental ability test are 'oriented, spe- 
cifically to the whole environment as the- * * 
source of knowledge.* The greater harm, how- 
ever, lies in employing a test of this sort 
with the disadvantaged child who, because of 
his meager background, is unable to cope on 
equal terms with someone no more "alert- or of 
no greater -mental ability than he. 

The tests rej: lect , , but do not measure, 
the magnitude of the* "disadvantage" as, re- 
gards school accomplishement*. 

We need .to be sure, for example, to 
choose a test such that nothing in the test- 
taking experience adds to his difficulties. 
° Practice sheets, careful oral instruction, 
time- for questions before testing, etc., all 
ca» help* \~ » x " 

Test-taking skill can and should be 
taught* as "a prerequisite of actual test ad- 
ministration. However, by the beginning of 
♦the fourth grade few pupils will not have 
been* subj ected to objective t testing in this 
s tate . 

Moreover, if one divests himself of!: the' 
*idea- that these tests measure, native intel - 
ligence or something that cannot be .changed 
by. .enriching and "expanding hor.izons, most .of. 
our ^hangups" disappear. For all practical * 



purposes, tests of this sort reflect what a 
child is able^to do at a particular moment, 
but not necessarily what he will be able to 
do if he is given proper stimulation. 

The sad part is, not the instability of 
mental ability measures, but their consis- ' 
tency over a wide sparf*of years. 

Perhaps of greatest importance of all 
is the fact that the general mental ability 
test is the one test that correlates most 
highly with almost any other measured school- 
learned skill. This is not only true of 
reading and vocabulary, which are themselves 
saturated^ with language , but it is equally 
true of mathematics -.and particularly so of 

concepts and Applications.. 

t - - - — — 

It certainly is true of certain aspects 
of science and social studies testing-also, 
especially in the middle and upper gra'des - 
and especially in the more modern textbooks 
where there is a diminution of emphasis on 
knowledge of certain facts about history, 
social studies, or science in favor of the 
development oM: skills in adapting to new 
facts (which are developing all the time) 
and in the development of ability to find 
and assess information of relevance to some 
problem that needs to be solved at the mo- 
"ment. 

Finally/ it is of greatest ^importance - 
to emphasize in this study that 'the Title I- 
chil<ij:em studied were not chosen on the ba^ 
sis of any^ mental ability measure or even on 
the basis of a systematic achievement test , 
program. All 'of these came after the fact , 
so to. speak . ^ 

Children had already been selected and 
allocated to Title I projects before the, 
opening of school, so it remained to admin- 
ister the tests to these children, along 
with ail of the other children in the grades 
involved (2, 4, 6, and 8 in the years indi-" 
cated), in October and to re-test the Title 
I children and the random sample in the fol- 
lowing late April and/or early May. 

Differentiation 6f Achievement Tests V 

— ^ ! 

Let us now take a brief look at the-— '- 
achievement tests - namely, the" Stanford ;~ 
Achievement Test: Intermediate FTBattery: „ 
Form X - and try to make a judgment, as" fair- 
ly as possible as to the extent to .which the 
content of these tests is ^biased ,£n\ favor of 
.one socioeconomic- group as compared to-ahx"-* 
other. ' - ' - 



. Irf doing this, it has to be remembered 
that there is no single very prominent low 
socioeconomic .group in £he State /of "New 



\ 
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Hampshire' (as ' in the South or in our nation- 
al metropolitan areas).' There may be some- 
thing of a bilingual problem- in the northern- 
most counties and in. some of the southern 
cities, but the proportion of bilingual chil- 
dren is very small. 

Looking first at the Word Meaning Test, 
it is very important to remember that these 
< words came from .materials found appropriate 
for the grade level in tei;ms of the vocabu- 
lary widely used in- textbooks in this grade. 
This was true at the time the words were se- 
lected to 'be tested and the items written, 
but the, process of item, analysis eliminated 
words which were non-functioning for the to- 
tal group; i.e., words too simple or too 
complex. 

There were, indeed, hard items and easy 
items left in the test; otherwise, the least 
able and the most able children in the group' 
tested would not have been able to make an 
.acceptable unbiased score. 

Incidentally, as we get into this study 
we must conclude that the difference between 
a survey test, such as Stanford, and a test 
intended from the beginning to measure the 
be fore -after performance at a single grade 
in a single state or community is very 
great; indeed, it. is much greater than any 
of us realized, perhaps, until this study 
was carried out. * 

, ~ s / r , ' Paragraph Meaning , as contrasted to vo- 
cabulary development alone, 'does^ include the 
development of certain specifically taught 
skills - such as the ability to make a pho- 
nic attack on new words; that is, to derive 
the meaning of words that are new to the 
child in their written form - at least in 
the context of this test. 

It only happens very rarely, and par- 
ticularly with children who are low in gen- , 
eral mental ability, that a word a child 
might be expected. to learn to read is not 
already known to him when it is spoken. The 
child's spoken vocabulary tremendously ex- 
ceeds his written- or -reading .vocabulary at 
the time he goes to school, and .probably 

. through the lowe/r elementary grades . , For . 
many pe&ple,. this remains/ to be true through 

- -their- whole lives/. 



This is the essence of reading instruc- 
tion in the lower grades; namely, learning 
the written symbol that stands for a partic- 
ular word" we already know when spoken. Lat- 
er, the process 'may be reversed; we may 
learn to speak and write words encountered 
first in reading! This , however, occurs on- 
ly in the higher grades among children al- 
ready rated as good readers . 

There have been, over the laat couple ~~ 
of decades , violent controversies as to 
whether the, look-say method, (or the whole 
method, as it is sometimes called) is better 
than the phonioRtriethod. 

There are arguments to be made for both 
approaches, but probably, the most unbiased 
and uncommitted study done in this area 
seems to indicate that method makes rela- 
tively little difference, provided the 
* teacher adapts his or her instruction to the 
need of the individual child. 1/ 

♦ . • 1 ~ <• 

In Arithmetic Computation* , as contrast- 
ed to most other school subjects, there are 
certain basic knowledges and skills which 
have to be mastered, and. a lack of mastery 
of these skills constitutes a continuing \ 
handicap throughout one's life. 

For ^example, if a child does not know 
his 100 /Wdition and subtraction, facts and 
.eventually his multiplication tables, he 
will be handicapped constantly in doing oth- ^ 
er kinds o-f arithmetic. • 

He may be able to ttjink his way through 
certain abstractions in advanced mathemat- 
ics, which really involves, little manipula- 
tion of numbers but rather encompasses con- 
stellations of ideas concerning the rela- 
tionships between qu«antitative ideas. 

Is all this repetitive? Well, perhaps 
sol , It will stand repeating. Some people 
may read only Parts III and IVf v ; 



1/ Chali, Jeanne. Learning- To, Read: "The 
. Great Debate . New York: McGraw-Hill, 
1967/ . . • 
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THE TITLE I DATA COMPARED WITH RANDOM SAMPLE 

Enough has been said, in th$ previous 
paragraphs to* lead us directly into a com* 
parisqn of the data for Title I that is 
strictly comparable in its nature to the da- 
ta previously presented for the random sam- 
ple: 

In the random sample section of this 
report (Part II) we gave as histograms the 
actual distributions of Word Meaning in the 
fall and spring and also Arithmetic Computa- 
tion in the fall and spring. We will do the 
same thing for Title J. 

A test which is used at the beginning 
and again at the end of instruction, whicli 
was true in this case, needs to be on the 
hard side at the beginning in order to give 
the, pupils an opportunity for the maximum * 
amount of learning during the period of in- 
struction. s 

Coupled with this ,* of course, is the* 
corollary that the material that is not 
known at the start is material to which the 
chi Id will be exposed during the course of 
the year's instruction. 

Therefore we have to be very careful to 
be sure that the test does, indeed, meas-ure 
what- the teacher intends to teach -/not so 4 
much specific-item-by-specific-iteiA as in ar 
broad, general way;- e.g. s> not; so much the 
meaning of "attachment,"' as the broad skills, 
in method attack which will help the child 
v learn this word. 

The Word Meaning Tfcst is broadly based, 
especially as to type; but not with the idea 
that these identical problems define the 
curriculum-. The child f s eventual vocabulary 
is much larger than any curriculum in word 
meaning* 

/ • 

The test sample is so small a* sample of 
the total vocabulary that neither this test 
nor any other group of commonly used. words 
will, of a certainty, be found in the local 
curriculum^ ' 

Word Meaning Conten.t 

Related to the Title I, Score Distribution 

These words represgfit a Woss sect.ion 
or random sample of thfe kinds of words cnil- 
drea at grade four are likely to^encounter ; 
plus- a good saturation, of *wpr<te that the f( & 
children should have been exposed to at * • ' * 
grade three and some harder, ones to give-' 
*top n to theTtest. 

-Ip Figure III-l, we give tlje Word Mean- 
ing distribution, for the fall. If 'one looks 
back at the 'similajr dis tritotion for the. 



random sample (Figure II-l), it is easy to 
see that this test was much harder for Title 
I than it was for the random sample; and yet 
the Title I fall q:is trib ution does have 
cases earning scores as high as 32 out of a 
38-item test. 

_ «•_•' 

Th^, mean of 9.13 is substantially lower 
than the' random sample^mean of 15.92, and 
the random sample is considerably more vari- 
able - as indicated by the comparative stan- 
dard deviations . » * 

Tfre^.poink,, however, is that the group 
selected for Tixle I does distribute itself 
across the continuum of vocabulary as mea- 
sured by Stanford. 

The important thing to note regarding 
the Title I fall distribution (Figure III-l) 
is not so much the piling up at the lower 
end, but the fact that the median (and mo- 
dal) value in this distribution is only a 
score of 8! 

> -\ 

The average chance score on this dis- 
tribution of four-choice items would be 9.5 
questions answered correctly out of the 38 
items, which is the number of items included 
in this test. The mean score of 9.13, as a 
matter of fact, is slightly below the chance 
level (9.5). 

/ 

However, when the children were re- ». 
tested in the spring, the mean had moved up 
to 13.2 (Figure III-2) and, although the 
gain of : ^ words for pointy of store) between 
Octoberjand May is certainly not anything to 
be gleeful about, the test at this point 
does no5 I00K much different than the kind 
of distribution "we very often get, with a 
survey-type standardized achievement test 
with similar groups. ' * 

ir 

Obviously, all those who earned scores 
below 9 or 10 did not do so by chance -alone ;* 
the problerir is (and always has been) • How 
many correct responses were obtained by 
guessing?, 

We do not give nearly enough emphasis . 
to the fact that; in tests of this sort t^here 
are large numbers of children who are clear- • 
ly working so far^ below their gr&de levei 
that'they simply do not J\ave t)»e opportunity 
to progress' very far above the guessing lev- 
el during the 1 relatively short instructional 
♦period of* time involved (seven mdhths). , «$. 

If vocabulary building irt'^Ke local 
situation is not specifically a goal q£ in- 
istruction i> u t is left to incidental learning*' 
in connection w^ith all instruction, »oC» stan- 
dardized vocabulary ^test^is "#urriculum valid 
(in the strictest sensie) at tM? local levelr ; 
Not aven a ^ocklly-made .test would be valid, 
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FIGURE *"fll-l / 
Frequency Distribution, Cumulative Perc*t Distribution, and Starffn 
Plus Histogram Showing Shape of Raw Sco\^ Distribution" Graphically 

TITLE I- WORD MEANING - FALL L969* - 
Mean*- 9-. 13 . mm . ^ St. Dev. 
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FIGURE III -2 - 

Frequency Distribution, Cumulative Percent Distribution, and Sfcanines 
Plus Histogram Showing Shape of Raw Score Distribution Graphically 

* TITLE I - WORD MEANING - SPRING 1970* 
• Mean - 13.21 , * - St. Dev. - 
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ment ' scale than* it is to me'asur 
the top, since those at the^top 
by substantial margins the aver 
lary performance of children at 
due to general environmental ,fa 
,as instruction. * - 
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The Averagfe ,Chance Score 

We also s&e that 261 children in. this 
"gtoup of '432 Title I children taking the 

-Word Meaning Teist in't^e fall (Figure III-l) 

'achieved scores of 9 or. lower, which means 
that the^ scored essentially at the chance 
level. In other words., if these children 

-had- simply ma'rked^the v paper without ever 
lookirtg at it, they would have a 50/50 

t chance of getting as'high a score as they 
earned.' " " 

The average -chance scoK$ depends on the 
number of alternative choices provided and , 
'the number of items. It ±s a fraction with 
"1' as the numerator and the number of al- 
ternatives as ?th^ k denominator times M n" 
i£ems. Thus, for a four-choice test it is" 
'^n,* where "n" i's the number of items" in Che 
•test. . 1 



Arithmetic Computation Distribution 
Characteristics ^ 

Turning' now, to' the* Arithmetic Computa- 
tion Test, it 'is~ evident that this testis • 
also too hard. It is*too hard even f rfr the 
random sample. • It is a, 39-itfem test, but ' 
the highest score obtained by ^anyone is only , 
29 in the fall random sample group * 

Arithmetic Computation is f ivef choice 
item test; and therefore by chance, on the 
average, marking. the ati^wer sheet witEout 
reference to the test would giveMn indiyid-. 
^ual a Store' of \/5 of . the ^dtalt^umbervof 
""items (39), or 8 items right in/'rouri&ijCim- 
bers rr - , ' < , 



(Eight is the average of; a oormQ^ dis- 
tribution of. r errors for 39 fiye'-cgpic^ -items* 
•but the standard deviation {ft, thij^ distribu- 
tion, which cannot, *be exactly o^fclfciried by , i 
'any simble method, is probably ati.o^ut 4. or 5 
points.), * r 49 k ^" jV' ' 



In the random sample in the fall pro- 
gram, ,267^ of the children had sqores that 
could have been obtained as frequently as 
not by chance, assuming all 39 items had 
been marked. This dropped to 8% in the" 
spring.. . 

* ■ . 
By and large a very substantial majori- 
ty of children answer the questions they 
know and omit many of the remaining items; . 
these they do not mark by chance, obviously/ 
This i-s as it should be. 

This study is most revealing in showing 
that the proportion of those who do use " 
chance marking is substantially greater than 
we had suspected., it might be. We must ask 
ourselves, very seriously if this is a toler- 
able situation. - - 



But" what about the wrong responses? 
What proportion of wrongs to rights is ac- 
ceptable? In a work-sample type item, where 
there is little or no chance of guessing, 
the prpportion would be zero! 

Thersi-tnatrton naTuratty is worse in the" 
cas.efof Title with 41% of, the children 
achieving scores at the average chariee level 
or below in the fall. (See Figure XII-3) 
This means, of course, that there will be a 
roughly equal number of others who most 
likely' have earned higher scores by chance,; 
they were among the lucky ones in their * 
choice of correct answers, if you take their* 
point, of view. » 



All this theory applies only when'all 0 
the items have been marked, For 'a child who 
attempts 29 of 4 a 39-item test, '29 (not 39) 
is the -effective tesf length for £ha o E child 
and the^charice situation is changed. This • 
is the fallacy *>f they«tra8itional correction 
Tor chance. C * " 

Remembering. now that we have concluded 
prior to this that a difficult test is de- ■ 
sirable at the beginning of the instruction- 
al period, we note that the mean for' the " 
random sample was -11.46 in»,the fall but that 
this 'jumped to 18..34 in the spring, or a to- 
tal of about 7 points during the cotirse,of 
the seven-months period between fall and 
spring testing.- 



4 ' For~ the T.i tie I group , the gain*l£ only, 
4^ points , but (ip view of the faxt^tfiat 
this .is a. much less able group) this is- a * 
notable gain by comparison. r (§ee FigUre 
III-4.)' tfhere still' remains about 177 0 of 
the group, even, in tlffe spring, who are -at , 
.the average chance lAvel or below, all other 
previously , stated conditions -applying. 
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FIGURE 1 ItI-3 . v * , 

Frequency Distribution, Cumulative Percent Distribution, and Stanines ° 
Plus Histogram Showing sHape of Raw &coe£ Qfs tribution Graphically , 

' TITLE I - ARITHMETIC COMPUTATION - FALL 1969* 
Meari - 9.94 - . ^ , * : ,St.Dev. - 4.^03 
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